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Abstract 



In this paper, we consider the "foreach" sparse recovery problem with failure probability p. The goal 
of which is to design a distribution over m x N matrices $ and a decoding algorithm A such that for 
every x £ M. N , we have the following error guarantee with probability at least 1 — p 



|x-j4($x)|| 2 <C||x-Xfc||2, 



where C is a constant (ideally arbitrarily close to 1) and x& is the best fc-sparse approximation of x. 

Much of the sparse recovery or compressive sensing literature has focused on the case of either p = 
or p = fi(l). We initiate the study of this problem for the entire range of failure probability. Our two 
main results are as follows: 



(N 

en 

VO 1. We prove a lower bound on m, the number measurements, of 0(felog(n/fe) + log(l/p)) for 

^_ 2~ e ( Ar ) ^ p < 1, Cohen, Dahmen, and DeVore [5 1 prove that this bound is tight. 

2. We prove nearly matching upper bounds for sub-linear time decoding. Previous such results ad- 
dressed only p = 0(1). 

Our results and techniques lead to the following corollaries: (i) the first ever sub-linear time decoding 
£i/£\ "forall" sparse recovery system that requires a log 7 TV extra factor (for some 7 < 1) over the 
optimal 0(k\og(N/k)) number of measurements, and (ii) extensions of Gilbert et al. [7] results for 
^j information-theoretically bounded adversaries. 

Cohen, Dahmen, and DeVore |6| prove a fl(N) lower bound for the "forall" case (i.e. p = 0). Our 
lower bound technique is inspired by their geometric approach, and is thus very different from prior 
lower bound proofs using communication complexity. Our technique yields a stronger result which 
holds for the entire range of failure probability p, as well as provides a simpler, more intuitive proof of 
the original result by Cohen et al. For the upper bounds, we provide several algorithms that span the 
trade-offs between number of measurements and failure probability. These algorithms include several 
innovations that may be of use in similar applications. They include a new combination of the recur- 
sive constructive sparse recovery technique of Porat and Strauss El and the efficiently decodable list 
recoverable codes. These list recoverable codes focus on a different range of parameters than the ones 
considered in traditional coding theory. Our best parameters are obtained by considering a code defined 
by the algorithmic version of the Loomis-Whitney inequality due to Ngo, Porat, Re and Rudra 
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1 Introduction 

In a large number of modern scientific and computational applications, we have considerably more data 
than we can hope to process efficiently and more data than is essential for distilling useful information. 
Sparse signal recovery [ 8 ] is one method for both reducing the amount of data we collect or process initially 
and then, from the reduced collection of observations, recovering (an approximation to) the key pieces of 
information in the data. Sparse recovery assumes the following mathematical model: a data point is a vector 
x G R , using a matrix <3? of size m x N, where m <C N, we collect "measurements" of x non-adaptively 
and linearly as <£x; then, using a "recovery algorithm" A, we return a good approximation to x. The error 
guarantee must satisfy 

||x-A($x)|| 2 <C||x-x fc || 2 , (1) 

where C is a constant (ideally arbitrarily close to 1) and x& is the best fc-sparse approximation of x. This 
is customarily called an £2 1 'd-2-error guarantee in the literature. This paper considers the sparse recovery 
problem with failure probability p, the goal of which is to design a distribution over mx N matrices $ and a 
decoding algorithm A such that for every x G M N , the error guarantee holds with probability at least 1 — p. 
The reader is referred to [8] and the references therein for a survey of sparse matrix techniques for sparse 
recovery, and to [2 J for a collection of articles (and the references therein) that emphasize the applications 
of sparse recovery in signal and image processing. 

There are many parameters of interest in the design problem: (i) number of measurements m; (ii) decoding 
time, i.e. runtime of algorithm A; (iii) approximation factor C and (iv) failure probability p. We would 
like to minimize all the four parameters simultaneously. It turns out, however, that optimizing the failure 
probability p can lead to wildly different recovery schemes. Much of the sparse recovery or compressive 
sensing literature has focused on the case of either p = (which is called the "foralF model) or p = S7(l) 
(the "foreach" model). Cohen, Dahmen, and DeVore [6] showed a lower bound of m = Q(N) for the 
number of measurements when p = 0, rendering a sparse recovery system useless as one must collect 
(asymptotically) as many measurements as the length of the original signajj Thus, algorithmically there is 
not much to do in this regime. 

The case of p ^ ^(1) has resulted in much more algorithmic success. Candes and Tao showed in 
that 0(klog(N/k)) random measurements with a polynomial time recovery algorithm are sufficient for 
compressible vectors and Cohen, et al. [6] show that 0(k\og(N/k)) measurements are sufficient for any 
vector (but the recovery algorithm given is not polynomial time). In a subsequent paper, Cohen, et al. 
give a polynomial time algorithm with 0(klog(N/k)) measurements. The next goal was to match the 
0(k log(N/k)) measurements but with sub-linear time decoding. This was achieved by Gilbert, Li, Porat, 
and Strauss J9) who showed that there is a distribution onm x N matrices with m = 0{k\og(N/k)) and 
a decoding algorithm A such that, for each x G 1^ the ^2/^2 -error guarantee is satisfied with probability 
p = 0(1). The next natural goal was to nail down the correct dependence on C = 1 + e. Gilbert et 
al.'s result actually needs 0(-k\og{N/k)) measurements. This was then shown to be tight by Price and 
Woodruff El. 

At this point, we completely understand the problem for the case of p = orp = fi(l). Somewhat 
surprisingly, there is no work that has explicitly considered the £2/^2 sparse recovery problem when < 
p ^ o(l). The main goal of this paper is to close this gap in our understanding. 

Given the importance of the sparse recovery problem, we believe that it is important to close the gap. Similar 
studies have been done extensively in a closely related field: coding theory. While the model of worst-case 



'For this reason, all of the forall sparse signal recovery results satisfy a different, weaker error guarantee. E.g. in the £1/ 



forall sparse recovery we replace the condition (fib by ||x — A(<3?x) ||i ^ C||x — Xfc||i. 



errors pioneered by Hamming (which corresponds to the forall model) and the oblivious/stochastic error 
model pioneered by Shannon (which corresponds to the foreach model) are most well-known, there is a 
rich set of results in trying to understand the power of intermediate channels, including the arbitrarily vary- 
ing channel lfl"5ll . Another way to consider intermediate channels is to consider computationally bounded 
adversaries ifTTl . Gilbert et al. [7] considered a computationally bounded adversarial model for the sparse 
recovery problem in which signals are generated neither obliviously (as in the foreach model) nor adversar- 
ially (in the forall model) in order to interpolate between the forall and foreach signal models. Our results 
in this paper imply new results for the i\ji\ sparse recovery problem as well as the ^2/^2 sparse recovery 
problem against bounded adversaries. 

Our main contributions are as follows. 

1. We prove that the number measurements has to be Q(k log(N/k) + log(l/p)) for 2~®( N > ^ p < 1. 

2. We prove nearly matching upper bounds for sub-linear time decoding. 

3. We present applications of our result to obtain 

(i) the best known number of measurements for t\/£\ forall sparse recovery with sublinear (poly(/c, log N)) 
time decoding, and 

(ii) nearly tight upper and lower bounds on the number of measurements needed to perform £2/^2- 
sparse recovery against information-fheoretically bounded adversary. 

As was motioned earlier, there are many parameters one could optimize. We will not pay very close attention 
to the approximation factor C, other than to stipulate that C ^ 0(1). In most of our upper bounds, we can 
handle C = 1 + e for an arbitrary constant e, but optimizing the dependence on e is beyond the scope of 
this paper. 

Lower Bound Result. We prove a lower bound of fi(log(l/p)) on the number of measurements when 
the failure probability satisfies 2~ ®( N ) ^ p < 1, (When p ^ 2~ n ( N \ our results imply a tight bound 
of m = Q(N).) The O(log(l/p)) lower bound along with the lower bound of Q(klog(N/k)) from ll23l 
implies the final form of the lower bound claimed above. The obvious follow-up question is whether this 
bound is tight. Indeed, an upper bound result Cohen, Dahmen, and DeVore [ 5 ] proves that this bound is 



tight if we only care about polynomial time decoding (see Theorem G.l ). Thus, the interesting algorithmic 



question is how close we can get to this bound with sub-linear time decoding. 

Upper Bound Results. For the upper bounds, we provide several algorithms that span the trade-offs be- 
tween number of measurements and failure probability. For completeness, we include the running times and 
the space requirements of the algorithms and measurement matrices in Table [T] which summarizes our main 
results and compares them with existing results. 

We begin by first considering the most natural way to boost the failure probability of a given ^2/^2 sparse 
recovery problem: we repeat the scheme s times with independent randomness and pick the "best" answer- 
see Appendix |AJ for more details. This boosts the decoding error probability from the original p to p n W- 
though the reduction does blow up the approximation factor by a multiplicative factor of \/3- 

The above implies that if we can optimally solve the £2/^2 sparse recovery problem for p ^ (N/k)~ k 
(i.e. with 0(klog(N/k)) measurements), then we can solve the problem optimally for smaller p. Hence, 
for the rest of the description we focus on the case p ^ (N/k)~°( k ) (where the goal is to obtain m = 
0(klog(N/k))). Note that in this case, the amplification does not help as even for p = il(l), previous 
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Table 1: Summary of algorithmic results. The results in |22| are for £i/£\ forall sparse recovery but their 
results can be easily adapted to our setting with our proofs, c is some constant ^ 8 and a > is any arbitrary 
constant and we ignore the constant factors in front of all the expressions. 



results (e.g. (231) imply that m ^ Q(k log(N/k)). Thus, if the original decoding error probability is p then 
to obtain the (N/k)~ k decoding error probability implies that the number of measurements will be larger 
than the optimal value a factor of k \og(N/k)/ log(l/p). As we will see shortly the best know upper bound 
can achieve p = 2~ k , which implies that amplification will be larger than the optimal value of Q(k log(N/k) 
by a factor of log(N/k). In this work, we show how to achieve the same goal with an asymptotically smaller 
blow-up. 

For p ^ (N/k)~ k , there are two related works. The first is that of Porat and Strauss l22l who considered 
the sparse recovery problem under the £i/£i forall guarantee. Despite the different error guarantee, our 
construction is closely related to that of [22] and our proofs imply the results for [22] listed in Table [TJ Note 
that the results for polynomially large k are pretty much the same except we have a better space complexity. 
For general k, our result also has better number of measurements and failure probability guarantee. The 
second work is that of Gilbert et al. [9]. Even though the results in that paper are cited f or p ^ STi(l), it 
can be shown that if one uses 0(/c)-wise independent random variables instead of the pair-wise independent 
random variables as used in J5J, one can obtain a "weak system" with failure probability 2~ k . Then our 



"weak system to top level system conversion" (Lemma 4.12 ) leads to the result claimed in the second to last 
row in Table [TJ Our results have a better failure probability at the cost of larger number of measurements. 

It is natural to ask whether decreasing the failure probability (the base changed from 2 to (N/k)) is worth 
giving up the optimality in the number of measurements (which is what [9] obtains). Note that achieving 
a failure probability of (N/k)~ k is a very natural goal and our results are better than those in [9] when we 
anchor on the failure probability goal first. Interestingly, it turns out that this difference is also crucial to our 
result for £\ j£\ sparse recovery which is discussed next. 



i\ji\ forall sparse recovery. As was mentioned earlier, our construction is similar to that of Porat and 
Strauss. In fact, our techniques also imply some results for the £\jl\ forall sparse recovery problem. (We 
only need the "weak system" construction, after that results from [22] can be applied.) We highlight two 
results here. 

First, for k ^ N n ^ we can get 0(klog(N/k)) measurements with k 1+a poly\ogN decoding time and 
space usage, for any a > 0. This improves upon the £i/l\ forall sparse recovery result from 621 with a 
better space usage and answers a question left open by ll22l . 

Our second result, which is for general k, achieves for any a > 0, 0{k\og(N/k) log£ N) number of 
measurements with decoding time k 2 "poly log N. This greatly improves upon the extra factor in the num- 
ber of measurements over the optimal 0(klog(N/k)) from something around logf N in ll22l to log£ N 
for arbitrary a > 0. Even though this "only" shaves of log factors, no prior work on sublinear time de- 
codable £\/£i sparse recovery scheme has been able to breach the "log barrier." There is a folklore con- 
struction for a sub-linear time decodable matrix: using a "bit-tester matrix" along with a lossless bipartite 



expander where each edge in its adjacency matrix is replaced with a random ±1 value. This scheme achieves 
m = 0(k\og(N/k) logN). If we consider a "weak system" only and small values of k, then the folklore 
construction achieves failure probability p = (N/k)~ k which matches that of our construction. This along 
with a simple union bound (which we also use in our construction) and results in [22] implies an £±/£i 
sparse recovery system. However, this simple construction has an extra factor of log N in the number of 
measurements, which our result reduces to log^ N for any a > Oq 

Bounded Adversary Results. We also obtain some results for ^2/^2 -sparse recovery against information- 
theoretically-bounded adversaries as considered by Gilbert et al. 0. (See Section|2]for a formal definition 
of such bounded adversaries.) Gilbert et al. show that 0(k\og(N/k)) measurements is sufficient for such 
adversaries with O(logiV) bits of information. Our results allow us to prove results for a general number 
of information bits bound of s. In particular, we observe that for such adversaries 0(klog(N/k) + s) 
measurements suffice. Further, if one desires sublinear time decoding then our results in Table [T] allows 
for a similar conclusion but with extra poly log /c factors. We also observe that one needs 0(\/i) many 
measurements against such an adversary (assuming the entries are polynomially large). In the final version 
of the paper, we will present a proof suggested to us by an anonymous reviewer that leads to the optimal 
O(s) lower bound. 

Lower Bound Techniques. Our lower bound technique is inspired by the geometric approach of Cohen et 
al. O for the p = case. Our bound holds for the entire range of failure probability p. Our technique also 
yields a simpler and more intuitive proof of Cohen et al. result. Both results hold even for sparsity k = 1. 

The technical crux of the lower bound result in [ 6 ] for p = is to show that any measurement matrix <3? with 
0(N/C 2 ) rows has a null space vector n that is "non-flat," - i.e. n has one coordinate that has most of the 
mass of n. On the other hand since <3?n = 0, the decoding algorithm A has to output the same answer when 
x = n and when x = 0. It is easy to see that then A does not satisfy ([[} for at least one of these two cases 
(the output for has to be while the output for n has to be non-flat and in particular not 0). 

To briefly introduce our technique, consider the case of p = 2~ N (where we want a lower bound of m = 
tt(N)). The straightforward extension of Cohen, et al.'s argument is to define a distribution over, say, 
all the unit vectors in Mr, argue that this gives a large measure of "bad vectors," and then apply Yao's 
minimax lemma to obtain our final result; i.e., that there are "a lot" of non-flat vectors in the null space of a 
given matrix <J>. This argument fails because the distribution on the bad vectors must be independent of the 
measurement matrix <J> (and algorithm A) in order to apply Yao's lemma but null space vectors, of course, 
depend on $. On the other hand, if we define the "hard" distribution to be the uniform distribution, then the 
measure of null space vectors for any m ^ 1 is zero, and thus this obvious generalization does not work. 

We overcome this obstacle with a simple idea. Our hard distribution is still the uniform distribution on the 
unit sphere S 1 ^^ 1 . We first show that there is a region R on this sphere with large measure (^ p) such that 
all vectors in R have a positive "spike" (large mass) at one particular coordinate j* € [N]. (The region R is 
simply a small spherical cap above the unit vector ej* .) In particular, to recover an input vector v G R, the 
algorithm has to assign a large positive mass to the j*fh coordinate of A{$v). Next, by applying a certain 
invertible linear reflector to R, we can construct a region R' (which is also a region on the sphere, and is 
just a reflection of R) with the same measure satisfying the following: for each vector v 6 R, the reflection 
v' of v (v 7 G R') has a negative spike at the same coordinate j*; furthermore, <I>v = <J>v', which forces the 
algorithm A into a dichotomy. The algorithm can not recover both v and v' well at once. Roughly speaking, 



interestingly one could use the weak system from |9| (with 0(fc)-wise independence) to obtain an l\/i.\ sparse recovery 
system with an extra factor of 0(\og(N/k)), which is not that much better than that obtained by the simple bit-tester construction. 



the algorithm will be wrong with probability at least half the total measure of R and R', which is p. Finally, 
Yao's lemma completes the lower bound proof. There are some additional technical obstacles that we need 
to overcome in this step — see Section |F2| for more details. 

In the final version of the paper, we will present an alternate lower bound proof that was suggested to us by 
an anonymous reviewer. 

Upper Bound Techniques. We believe that our main algorithmic contributions are the new techniques 
that we introduce in this paper, which should be useful in (similar) applications (e.g. in the ^1/^1 sparse 
recovery problem as we have already pointed out). 

Our upper bounds follows the same outline used by Gilbert, Li, Porat and Strauss (9l and Porat and 
Strauss ll22ll . At a high level, the construction follows three steps. The first step is to design an "identi- 
fication scheme," which in sub-linear time computes a set S C [N] of size roughly k that contains Q(k) 
of the "heavy hitters." (Heavy hitters are the coordinates where if the output vector does not put in enough 
mass then ([TJ will not be satisfied.) In the second step, we develop a "weak level system" which essen- 
tially estimates the values of coordinates in S. Finally, using a loop invariant iterative scheme, we convert 
the weak system into a "top level system," which is the overall system that we want to design. (The way 
this iterative procedure works is that it makes sure that after iteration i, one is missing only 0(k/2 l ) heavy 
hitters- so after log k steps we would have recovered all of them.) The last two steps are designed to run in 
time |5| • poly log N, so if the first step runs in sub-linear time, then the overall procedure is sub-lineaiF] 
Our main contribution is in the first step, so we will focus on the identification part here. The second step 
(taking median of measurements like Count-Sketch |4l) is standard lPT3l . For a pictorial overview of our 
identification scheme, see Figure [T] 

In order to highlight and to summarize our technical contributions, we present an overview of the scheme 
in E2l (when adapted to the ^2/^2 sparse recovery problem). We focus only on the identification step. 
For near-linear time identification, one uses a lossless bipartite expander where each edge in the adjacency 
matrix is replaced by a random ±1 value. The intuition is that because of the expansion property most heavy 
hitters will not collide with another heavy hitter in most of the measurements it participates in. Further, the 
expansion property implies that the l\ noise in most of the neighboring measurements will be low. (The 
random ±1 is a standard trick to convert this to an low £2 noise.) Thus, if we define the value of an index 
to be median value of all the measurements, then we should get very good estimates for most of the heavy 
hitters (and in particular, we can identify them by outputting the top 0{k) median values). Since this step 
implies computing N medians overall we have a near linear time computation. However, note that if we had 
access to a subset S' C [N] that had most of the heavy hitters in it, we can get away with a run time nearly 
linear in \S'\ (by just computing the medians in S'). 

This seems like a chicken and egg problem as the set S' is what we were after to begin with! Porat and 
Strauss use recursion to compute S' in sub-linear time. (The scheme was also subsequently used by Ngo, 
Porat and Rudra to design near optimal sub-linear time decodable ^1/^1 forall sparse recovery schemes 
for non-negative signals ED O To give the main intuition, consider the scheme that results in 0{yN) 
identification time. We think of the domain [N] as L x R, where both L and R are isomorphic to [x/iV]. 
(Think of L as the first -^|— bits in the log iV-bit representation of any index in [N] and R to be the 
remaining bits.) If one can obtain lists Sl C L and Sr C R that contain the projections of the heavy hitters 
in L and R, respectively, then Sl x Sr will contain all the heavy hitters, i.e. S' C Sl x Sr. (We can use the 
near linear time scheme to obtain Sl and Sr in 0{s/N) time in the base case. One also has to make sure 



3 We would also like to point out that Gilbert et al.'s construction has a failure probability of 0.(1) in the very first iteration of the 
last step (weak to top level system conversion) and it seems unlikely that this can be made smaller without significantly changing 
their scheme. 



that when going from [N] to a domain of size y N, not too many heavy hitters collide. This can be done 
by, say, randomly permuting [N] before applying the recursive scheme.) The simplest thing to do would be 
to set S' = Sl x Sr. However, since both \Sl\ and \Sr\ can be Q(k), this step itself will take £l{k 2 ) time, 
which is too much if we are shooting for a decoding time of /c 1+a poly log N for a < 1. The way Porat and 
Strauss solved this problem was to store the whole inversion map as a table. This allowed k ■ poly log N 
decoding time but the scheme ended up needing fi(JV) space overall. 

To get a running time of A; 1+a poly log N one needs to apply the recursive idea with more levels. One can 
think of the whole procedure as a recursion tree with J\f = 0(log fc N) nodes. Unfortunately, this process 
introduces another technical hurdle. At each node, the expander based scheme loses some, say Q, fraction 
of the heavy hitters. To bound the overall fraction of lost heavy hitters, Porat and Strauss use the naive 
union bound of Q • J\f. However, we need the overall fraction of lost heavy hitters to be 0(1). This in 
turn introduces extra factors of log;., N in the number of measurements (resulting in the ultimate number of 
measurements of k log(N/k) logf N in [22]). 

We are now ready to present the new ideas that improve upon Porat and Strauss' solutions to solve the two 
issues raised above. Instead of dividing [N] into [V^V] x [V^/V], we first apply a code C : [N] — > [\^N] r . 
(Note that the Porat Strauss construction corresponds to the case when r = b = 2 and C just "splits" 
the log N bits into two equal parts.) Thus, in our recursive algorithm at the root we will get r subsets 
S\, . . . , S r C [yN] with the guarantee that for (most) i £ [r], Si contains C{j)i for most heavy hitters 
j. Thus, we need to recover the j's for which the condition in the last sentence is true. This is exactly 
the list recovery problem that has been studied in the coding theory literature. (See e.g. [24].) Thus, if we 
can design a code C that solves the list recovery problem very efficiently, we would solve the first problem 
abovaj For the second problem, note that since we are using a code C, even if we only have C(j)i G Si 
for say r/2 positions i G [r], we can recover all such indices j. In other words, unlike in the Porat Strauss 
construction where we can lose a heavy hitter even if we lose it in any of the M recursive call, in our case 
we only lose a heavy hitter if it is lost in multiple recursive calls. This fact allows us to do a better union 
bound than the naive one used in [221. 



The question then is whether there exists code C with the desired properties. The most crucial part is that the 
code needs to have a decoding algorithm whose running time is (near) linear in max ie r r i \Si\. Further, we 
need such codes with r = O(l), i.e. of constant block length independent of maxj e [ r ] \Si\. Unfortunately, 
the known results on list recovery, be it for Reed-Solomon codes [12] or folded Reed-Solomon codes ifTOl 
do not work well in this regime — these results need r ^ f2(maxj e r r i \Si\), which is way too expensive. For 
our setting, the best we can do with Reed-Solomon list recovery is to do the naive thing of going through 
all possibilities in x ie r r i5j. (These codes however can correct for optimal number of errors and lead to our 
result in the last row of Table [T]) Fortunately, a recent result of Ngo, Porat, Re and Rudra [20] gave an 
algorithmic proof of the Loomis-Whitney inequality [18]. The (combinatorial) Loomis-Whitney inequality 
has found uses in theoretical computer science before lTT4l[T6l . In this work, we present the first application 
of the algorithmic Loomis Whitney inequality of [20] and show that it naturally defines a code C with the 
required (algorithmic) list recoverability. This code leads to the result in the second to last row of Table [T] 
Interestingly, we get optimal weak level systems by this method. We lose in the final failure probability 
because of the weak level to top level system conversion. 

We conclude the contribution overview by pointing out three technical aspects of our results. 

• As was mentioned earlier, we first randomly permute the columns of the matrix to make the recursion 
work. To complete our identification algorithm, we need to perform the inverse operation on the 



4 We would like to point out that 1211 also uses list recoverable codes but those codes are used in a different context: they used 
it to replace expanders and further, the codes have the traditional parameters. 



indices to be output. The naive way would be to use a table lookup, which will require 0(N log N) 
space, but would still be an improvement over [22 J. However, we are able to exploit the specific nature 
of the recursive tree and the fact that our main results use the Reed-Solomon code and the code based 
on Loomis-Whitney inequality to have sub-linear space usage. 

In the weak level to top level system, both Gilbert et al., and Porat and Strauss decrease the parameters 
geometrically — however in our case, we need to use different decay functions to obtain our failure 
probability. 



Unlike the argument in [22] we explicitly use an expander while Porat and Strauss used a random 
graph. However, because of this, [22J need at least iV-wise independence in their random variables 
to make their argument go through. Our use of expanders allows us to get away with using only 
0(/c)-wise independence, which among others leads to our better space usage. 



2 Preliminaries 

We fix notations, terminology, and concepts that will be used throughout the paper. 

Let [N] denote the set {1, . . . , N}. Let G : [N] x [£] -> [M] be an ^-regular bipartite graph, and Mg be its 
adjacency matrix. We will often switch back and forth between the graph G and the matrix Mq- 

For any subset S C [N], let T(S) C [M] denote the set of neighbor vertices of S in G Further, let £(S) 
denote the set of edges incident on S. A bipartite graph G : [N] x [£] — > [M] is a (t, e) -expander if for every 
subset S C [N] of\S\ ^ t, we have ^(S 1 )! ^ |<SK(1 — e). Several expander properties used in our proofs 
are listed in Appendix [B] Appendix [C] has some probability basics. 

Sparse Recovery Basics. For a vector x = {xi)f =1 G R , the set of k highest-magnitude coordinates 
of x is denoted by ilfc(x). Such elements are called heavy hitters. Every element % G [N] \ i?fe(x) such 

that \xi\ ^ V fc ' ll z ll 2 wn l ^ e called a heavy tail element. Here, ( and r\ are constants that will be clear 
from context. All the remaining indices will be called light tail elements; let C denote the set of light tail 
elements. A vector w = (wi)fL x G 1^ is called aflat tail if Wi = l/\S\ for every non-zero Wi, where 

S = supp(w). 

Definition 2.1. A probabilistic mx N matrix M is called an (fc, C) -approximate sparse recovery system or 
(k, C)-top level system with failure probability p if there exists a decoding algorithm A such that for every 
x G R , the following holds with probability at least 1 — p: 

\\x-A(Mx)\\ 2 ^C-||x-x Hfe(x) || 2 . 
The parameter m is called the number of measurements of the system. 

We will also consider (k, C) i\ji\ top level systems, which are the same as above except we have p = 
and the £2 norms are replaced by l\ norms. 

Definition 2.2. A probabilistic matrix Ai with N columns is called a (k, (, 7/)-weak identification matrix 
with (I, p)-guarantee if there is an algorithm that, given .Mx and a subset S C [N], with probability at least 
1 — p outputs a subset I C S such that (i) \I\ ^ £ and (ii) at most <^k of the elements of -fffc(x) are not 
present in I. The time taken to compute I will be called identification time. 



Definition 2.3. We will call a (random) m x N matrix A4 a (k, £, rf) weak £2/^2 system if the following 
holds for any vector x = y + z such that |supp(y)| ^ k. Given Mx one can compute x such that there exist 
y , z that satisfy the following properties: fljx = x+y + z; (2) |supp(x)| ^ 0{k/rj)}^\(3) |supp(y)| ^ (k; 
(4J||z|| 2 < (l + 0(7?))-||z|| 2 

We will also consider (&, £,7?) weak £i/£i systems, which is the same as above except the £2 norms are 
replaced by £\ norms (and the algorithm is deterministic). 

Coding Basics. In this section, we define and instantiate some (families) of codes that we will be interested 
in. We begin with some basic coding definitions. We will call a code C : [N] — > [q] r be a (r, iV^-codeH 
Vectors in the range of C are called codewords. Sometimes we will think of C as a subset of [q] r , defined 
the natural way. We will primarily be interested in list recoverable codes. In particular, 

Definition 2.4. Let N, q,r,£,L JS 1 be integers and ^ p ^ 1 be a real number. Then an (r, N) q - 
code C is called a (p,£,L)-list recoverable code if the following holds. Given any collection of subsets 
Si,...,S r C [q] such that \Si\ ^ £ for every i £ [r], there exists at most L codewords (ci, . . . , c r ) £ C 
such that ci £ Si for at least (1 — p)n indices i £ [r]. Further, we will call such a code recoverable in time 
T(£, N, q) if all such codewords can be computed within this time upper bound. 

An (r, N) q -code. is said to be uniform if for every i G [r] it is the case that C(x)i is uniformly distributed over 
[q] for uniformly random x G [N] . In our construction we will require codes that are both list recoverable 
and uniform. Neither of these concepts are new but our construction needs us to focus on parameter regimes 
that are generally not the object of study in coding theory. In particular, as in coding theory, we focus on the 
case where iV is increasing. Also we focus on the case when q grows with N, which is also a well studied 
regime. However, we consider the case when r is a. fixed. 

We now consider a code based on the Loomis- Whitney inequality lTT8l . Let d ^ 2 be an integer and assume 
that N is a power of 2 and \/N is an integer. Given x £ [N] we think of it as x = (xi, . . . , Xd) £ [\/~N] d . 
Further, for any i £ [d] define x_» = (x\, . . . , Xi-i,Xi+x, . . . , Xd) £ [yN] d ~ l . Then define Ciw(d)( x ) = 
(x-i, . . . ,X-d). The Loomis-Whitney inequality implies that CL\v(d) is a (0,£,£ d '^ _1 ^)-list recoverable 
code. Ngo, Porat, Re and Rudra [20] recently showed that the code is list-recoverable in time 0(£ d ^ d ^ 1 ^). 
The result implies the following: 

Lemma 2.5. The code CisN(d) i s a uniform code that is (0, £, £ d '^ d ~ 1 ')-list recoverable code. Further, it is 
recoverable in 0(^ d /( d_1 ) log N) time. 



For the sake of completeness, we prove the above via Theorem E.l (with a slightly different proof than the 



one from [20]). Finally, we consider the well-known Reed-Solomon (RS) codes. 

Lemma 2.6. Let p < 1/2(1 — b/r). Then the code Crs is a uniform code that is (p, £, £ r )-list recoverable 
code. Further, it is recoverable in 0(£ r r 2 log N) timePj 

5 This part is different from the weak system in [22], where we have jsupp(x)| ^ O(k). 

6 We depart from the standard convention and use the size of the code iV instead of its dimension log TV: this makes expressions 
simpler later on. 

7 Ther 2 log 2 TV factor follows from the fact that the Berlekamp Massey algorithm needs 0(r 2 ) operation over ¥ q , each of which 
takes 0(log 2 q) time. 



Bounded Adversary Model. We summarize the relevant definitions of (computationally-)bounded adver- 
saries from 0. In this setting, Mallory is the name of the process that generates inputs x to the sparse 
recovery problem. We recall two definitions for Mallory: 

• Oblivious: Mallory cannot see the matrix $ and generates the signal x independent from <J>. For 
sparse signal recovery, this model is equivalent to the "foreach" signal model. 

• Information- Theoretic: Mallory 's output has bounded mutual information with the matrix. To cast 
this in a computational light, we say that an algorithm M is (s-)information-theoretically-bounded if 
M(x) = M2{M\(x)), where the output of M\ consists of at most s bits. This model is similar to that 
of the "information bottleneck" [271. 



Lemma 1 of Q relates the information-theoretically bounded adversary to a bound on the success probabil- 
ity of an oblivious adversary. We re-state the lemma for completeness: 

Lemma 2.7. Pick £ = £(N), and fix < a < 1. Let A be any randomized algorithm which takes input 
x G {0, 1}^, r G {0, l} m , and "succeeds" with probability 1 — /3. Then for any information theoretically 
bounded algorithm M with space £, A(M{r) , r) succeeds with probability at least 



min{l-a, 1 - £/log(a/(3)} 



over the choices ofr. 



3 Lower bounds 

3.1 Lower bound for £ 2 /h -foreach sparse recovery with low risk 

Throughout this section, let $ denote anmxAf measurement matrix. Without loss of generality, we will 
assume that all our measurement matrices have full row rank: rank(<3?) = m. Let J\f be the (row) null-space 
of <J>, and P be the matrix representing the orthoprojection onto J\f. For j G [N], let ej denote the jth 
standard basis vector. Note that P = P 2 = P T P because any orthogonal projection matrix is symmetric 
and idempotent. Hence, for any two vectors x, y G M. N , 

(Px,y) = (x,Py) = (x,P T Py) = (Px,Py). 
For the sake of completeness we present a simplified version of the proof of the fi(JV) lower bound from Q 



for the ^2/^2 forall sparse recovery in Appendix Fl 



We first prove an auxiliary lemma which generalizes Proposition F. 1 The lemma implies that, for any 
measurement matrix 3>, if m/N is "small" then there will be "a lot" of unit vectors that are not "flat," i.e. in 
these vectors some coordinate j* has relatively large magnitude compared to all other coordinates. 

Lemma 3.1. Let <J> be an arbitrary real matrix of dimension m x N. Then, there exists j* G [N] such that, 
for any unit-length vector v G M. N we have 



(Pv,e f ) > (v,ej.) - J I - (v,e r ) ■ ^2m/N - m/N. (2) 



Proof. Let j* be the coordinate for which ||Pe 3 -* || 2 ^ 1 — m/N, guaranteed to exist by Proposition F.l For 
any unit vector v, we have 



(Pv, ej 



(by Cauchy-Schwarz) ^ 



> 



v, Pe r 

v,e^) 



v, e r 



(v, ej * - Pe^ 



■3*1 



(v-ej-*,ej. 



Pe,-*) 



(e^*,ej 



Pe, 



v-e,-*|| 2 • ||a,-* - Pe^|| 2 - (1 - (e^Pe,*)) 

(1 



Pe-,ll 2 



2-2(v,e r )-^l 
v, e,-* ) - J I - (v,ej.) • y / 2m/N - m/N 



Pe,-. 



■r 112; 



D 



Next, we show that any unit vector v sufficiently close to ej* can be paired up in a 1-to-l manner (through 
a reflection operator) with a unit vector v' such that the decoding algorithm cannot work well on both v and 
v'. Since the pairing is measure preserving, we can then infer that when m/N is small the algorithm A will 
fail with "high" probability. 

Lemma 3.2. Let (<!>, A) be a (deterministic) pair of measurement matrix and decoding algorithm, where <£ 
is anm x N matrix. Define 5 = m/N, and let 7 Js 0, C ^ 1 be arbitrary constants such that 



l-2j-25-2y/2^5> 



C 

7T+W 



(3) 



Let j* E [N] be the index satisfying Q guaranteed to exist by Lemma 3.1 Let v be any unit vector such 
that (v, ej* ) Js 1 — 7, and v' = (I — 2P) v where I is the identity matrix. Then, the following two conditions 
cannot be true at the same time: 

(a) ||v-A($v)|| 2 < C ■ ||v-v fc || 2 

(b) \\V - A(<S>V)\\ 2 ^ C ■ \\V - v' k \\ 2 



Proof. Since Pv is in the null space of $, we have <J?v' = 3>(v — 2Pv) = <!>v. To simplify notation, define 

z = A($V) = A(*v). 
Vector z is well-defined because A is deterministic. Assumes (a) holds, we will show that (b) does not hold. 



Let j* be the coordinate from Lemma 3.1 From (a), we have 



l-7-(z,ej») < (v,ej*)-(z,ej. 

< |(v-z, ei ,)| 

< ||v — z|| 2 

^ C-||v-v fc || 2 



< C.,/l-(v,e r }2 



Hence, 



< cyi-a-7) 2 - 

(z,e J *)^l- 7 -Cv / l-(l-7) 2 - 
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In particular, due to ([3]), 

(z,e/*) ^ 0. 

Consequently, 

||v'-z|| 2 > |(v'-z,e.,-*)| = |(v',ej») - (z,e,-*)| ^ (z,e,-*) - (v',ej*) ^ -(v',e,-* 

We next claim that 



-(v',e j *)>C^l-(v>,e j *) 2 . (4) 

Before proving the claim, let us first see how it implies that (b) does not hold. To this end, observe that 



C • ||v' - v'J 2 ^ Cy 1 - (v', e^) 2 < -(v' ; e,-.) ^ ||v' - z|| 2 . 

Finally, we prove claim Q, which is equivalent to (v ; , e^*) < — , c 2 : 

(v',e,-*> = ((I-2P)v,e r ) 

= (v,e i .}-2(Pv,e i .) 

(LemmaEU ^ 1 - 2(1 - 7 - 8 - \f2rj8) 

= -1 + 2^ + 25 + 2^/2^ (5) 

C 



(from ([3])) < 



Vi + c 1 



n 



Lemma 3.3. Le? ($, ^4) fee a yuce<i pair of (deterministic) measurement matrix and decoding algorithm, 
where <!> is an m x N matrix. Define 5 = m/N, and let 7 > 0, C ^ 1 be arbitrary constants satisfying Q. 
Suppose we chose input vectors x uniformly on the unit sphere S 1 ^^ 1 , then 



Pr[||x - A($x)|| 2 > C||x - x fe || 2 ] ^ VV7 • e" 



■ ln(2/ 7 ) 



3. 1 The set of vectors v for 



Proof. Let j* G [N] be the index satisfying Q guaranteed to exist by Lemma ; 

which (v, ej*} ^ 1 — 7 is called the (1 — 7)-cap about e.,-* on the sphere S N . It is known (see, e.g., ifTTl 

Lemma 2.3) that the (1 — 7)-cap has measure at least 



N ~ 1 = ,/T7^.p-f ln ( 2 /7) 



(l/^-c^/ar-^VVT-e 



The mapping v — >• (I — 2P)v from Lemma 3.2 is a reflection through the rowspace of <&, hence the image 



of the reflection is another cap on the sphere with exactly the same measure. From §5§ and Q, the two caps 



are disjoint. Lemma 3.2 then completes the proof, because if algorithm A works well on a vector in one cap, 



then it will not work well on the vector's reflection. □ 



Armed with the lemma above we prove our final lower bound stated below (and proved in Appendix F.2 1. 
To do this we need a continuous version of Yao's lemma and a simple padding trick to reduce the case of 
p = 2~®( iV ) to larger values of p. 

I ln(6+8C 2 ) ., 

Theorem 3.4. Let C ^ 1 and p be such that V12 + 16C 2 • e 2 JV ^ p < 1. Then, any £ 2 /h 
foreach sparse recovery scheme using mxN measurement matrices $ with failure probability at most p and 

approximation factor C must have m ^ (6+8C' 2 )l (6+8C 2, > ^ n ( — + ) = ^(log(l/p)) measurements. 
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3.2 Lower Bound for Bounded Adversary Model 

In this section, we show the following result: 

Theorem 3.5. Any £2/^2 sparse recovery scheme that uses at most b bits in each entry of <3? needs at 
least Q ( \/f) number of measurements to be successful against an s-information-theoretically-bounded 
adversary. 

The result follows from the ^2/^2 forall lower bound argument from Corollary |F. 2 1 and the following simple 
observation: 

Lemma 3.6. Let A be an m x n matrix with m ^ n. Consider the column sub-matrix A' which has the first 
n' (for some m ^ n' ^ n). Ifn' is in the null space of A', then n = (n ; , n _ n /) (i.e. vector n' followed by 
n — n' zeroes) is in the null space of A. 



Proof of Theorem 3.5 For the sake of contradiction assume that there exists an ^2/^2-sparse recovery matrix 
<J> against any s-information-theoretically -bounded adversary that achieves an approximation factor of C and 
has m < * y= measurements. Next we present an s-information-theoretically-bounded adversary that can 
foil such a system. 

Let <£' b e the column sub-matrix of $ that has only the first n! = \fs]b columns of <J>. Then by Proposi- 
tion 



F.l 



there is a unit vector n' that is in the null space of 3>' and satisfies Hn'Hoo ^ 1— i. By Lemma 
n = (n ; , 0) is in the null space of <£. Further, it is easy to verify that n is a unit vector and [|n||oo ^ 1 — 



3.6 

— f- 

Using the proof of Corollary |F2| one can then argue that any recovery algorithm will have to fail on either 
the input or n. 

To complete the proof, we need to argue that the adversary only needs to remember s-bits of information 
about $ to compute n'. Indeed note that we only need at most m ■ y/s/b ■ b ^ s/C 2 bits to describe the 
matrix <J>', which is enough to compute n. □ 



4 Sublinear Decoding 



We present known results with polynomial time decoding on £2/^2 sparse recovery problem in Appendix G. 1 

Our strategy for designing sub-linear time decodable top level systems will be as follows: we will first 
design weak identification matrices that have sublinear identification time. Then we (in a black-box manner) 
convert such matrices to sub-linear time decodable top level systems. We now present an outline of how we 
implement our strategy. 



In Section 4.4. 1 we show how expanders can be used to construct various schemes that will be useful later. In 



Section 4.4.2| we show how to convert weak identification systems to top level systems. The rest of technical 



development (in Section 4.4.3 and 4.5 ) is in designing weak identification system with good parameters. 



Our first main result on sub-linear time decodable top levels systems will be: 

Theorem 4.1. Foranyk ^ N^ 1 ' ande,a > 0, there exists a (k, l+e)-top level system with OJe~ 11 k\og(N/k)) 
measurements, failure probability (N/k) ' g and decoding time e~ 4 • k 1+a ■ log ' ' Nr\ This scheme 

uses O e (k ■ log ( ' N) bits of space. 



8 The O(-) notation here hides the dependence on a. 
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In fact, our results also work for k = N°^ but we then do not get the optimal number of measurements. 
However, an increase in the decoding time leads to our second main result, which has near-optimal number 
of measurements. 

Theorem 4.2. For any 1 ^ k ^ N and e,a > 0, there exists a (k, 1 + e)-top level system with 

0(e~ 11 k log(N/k) log^ N) measurements, failure probability (N/k)~ k ' log ' " k and decoding time (k/e)®( 2 '• 



U{£ K log(iV/fcj log fc iv ) measurements, failure probability i 
log ! 1 ) ./Vrl 77h'.s scheme uses £ (k ■ log ^ N) bits of space. 



4. 1 Proof of Main Results 



Later in this paper, we will prove the following results, which we will use to prove Theorems 4.1 and 4.2 



Lemma 4.3. Let < a, 7, 77 < 1 be real numbers and 1 ^ k ^ ko ^ N be integers. Then there exists an 
O (VV 4 • k ■ log(N/k) ■ (log fco log(7V /fc ) iV) 6+1 °s( 1 /«)/ l°g(i+art x N matrix that is ( fe) 7) rj). wea k iden- 
tification matrix with (0(k/r]),(kolog(N/ko))~ ^ a7 ' ° Sk o io s( N / k o) >\ -guarantee with identification time 
complexity ofO ( 7~ 3 7/~ 3 ~ a • k ■ k$ • log ^ ' N I, and space complexity qf0^ v (klog ^ ' N). 



Lemma 4.4. Let a be a small enough real number. Then for large enough n, there exists a (k, £, r\)-weak 
identification matrix with (0(k/i]), (n/k)~ 2k )-guarantee with 

0(C 7 r A -k-\og{n/k)-{\og k n) a ) 

measurements and a decoding time of 

CV 4 • fe° (2l/Q) • poly(logn) + C V 2 ■ (k/v)° {2l/a) • poly(logn). 



Lemmas [4.3| and |4.4| along our weak system to top level system conversion (Lemma |4.12| below) then prove 
Theorem 4.1 and |4.2| respectively 



4.1 



Proof of Theorem 

the existence of a (0{k/rj), 77/2, 77) weak identification matrix with 0(0(k/r)),(N/Ko) 



We first note that applying Lemma 4.3 with 7 = r?/2 and k$ ^ N^i 1 ) implies 



-ar/k 



guarantee 



with O(r]~ 10 klog(N/k)) measurements. By Lemma 4.10 below, we can amplify the failure probability to 
(./V/fco) - by increasing the number of measurements to 0(a _1 ?7 _11 fclog(./V/A;)). This implies that the 

" ... - .-. — • ... , ' : ' we 



4.11 



identification algorithm will identify all but k/2 elements of H k+k / v (x). This means like Lemma 

can convert this into a weak system on which we can then applying the conversion technique of Lemma |4T2 

to get the claimed bounds. (When we're applying the recursive procedure to k/2^ we use this value of k in 

the weak systems and the "original" value of k as fco in the weak system.) For the space requirement, we 

have to add up the space requirement for each weak system, which can be done within the claimed bound. 

□ 



Proof of Theorem 4.2 The proof is almost the same as that for Theorem 4.1 except we use Lemma 4.4 



instead of Lemma 4.3 The rest of the proof is exactly the same. (The way Lemma 4.12 is stated it needs 
0(k log(N/k)) measurements but it can be verified that the conversion also works with the extra 0(log k N) 
in the number of measurements.) □ 



'The O(-) notation here hides the dependence on a. 
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4.2 Consequences for the Bounded Adversary Model 

Our first cor ollary is an upper bound for the information-theoretic bounded adve rsary and follows directly 
from Lemma 



2.7 



(by setting /3 = q2 s / a ) and the result of [6] (see Theorem G.l I. 



Corollary 4.5. Fix < a < 1. There is a randomized sparse signal recovery algorithm that with m = 
0(k log(N/k) + s/a) measurements will foil an s-information-theoretically bounded adversary; that is, the 
algorithm's output will meet the ti/^2 error guarantees with probability 1 — a. 

The algorithm in [6] does not have a sublinear running time. If the goal is to defeat such an adversary 
and to do so with a sublinear algorithm, we must adjust our measurements accordingly, using Table 1 
as our reference for the range of parameters p and m. We note that in [7], there was a single result for 
0(log iV)-information-theoretically bounded adversaries (0(klog(N/k) measurements are sufficient) and 
this corollary provides an upper bound for the entire range of parameter s. 

4.3 Consequences for the £\/£i forall sparse recovery 

We observe that our techniques also work for the t\ jt\ foreach sparse recovery. Actually the fact that an 
^2/^2 -foreach sparse recovery system is also an £\/£\ -foreach sparse recovery system follows easily from 



known results. So in particular, Lemma 4.4 implies that one can get a similar system in the lx/l\ sense 



with the same parameters. Furthermore from Remark 1.3 to convert this into a weak (k, (, rf) identification 
system with (0(k/rj) , 0)-guarantee (i.e. a deterministic system), we need to take union bound over (, ,) + N x 
events, where k' = 0((,~ 5 ri~ 2 k) and x = 0(log(N/k)). Thus, for k ^ Q(log(N/k) with the existing 
machinery from 11221 to convert a weak system into a top-level system, we get the following result: 



Theorem 4.6. For any 1 ^ k ^ N (such that k ^ Q(log(N/k)) and e, a > 0, there exists a (k, 1 + e)- 
•£i/£i top level system with 0(e~ 18 klog(N/k) log^f N) measurements and decoding time (k/s) ^ 2 a ' ■ 
log ! 1 ) An^jr/n's scheme uses O e (k ■ log *- 1 ) N) bits of space. 

Proof Sketch. The results in [22] imply that a weak (k, rj/2, -rf) identification system (for any 7/ > 0) with 
(0(k/r]), 0)-guarantee (i.e., it is deterministic) can be converted into a (k, l + e)-{,i/t\ top level system with 
only a constant blowup in the number of measurements (with the same dependence on e as one has on if). 



We note that by Lemma 4.4 and Lemma 4.10|(and the observation above that an ^2/^2 guarantee implies 



an ^i/^i guarantee), we obtain a weak (kX,v) identification system with (0(k/r/), (N/k) n ^ v k ^)- 
guarantee with 0((,~ l2 rj~ % k\og(N / k) log^ N) number of measurements. Thus, by Remark 



1.3 



one can 



convert such a system into a deterministic one, which by the discussion in the paragraph above completes 
the proof. □ 

4.4 Basic Building Blocks 



In this section, we lay out some basic building blocks that will help us prove Lemmas 4.3 and 4.4 



4.4.1 Expander Based Sparse Recovery 

In this section, we record three results that will be useful to prove our final result. The proofs modify those 
from [22] and use an expander instead of random graph. (The actual arguments are similar.) The proofs are 
deferred to Appendix [H] We will prove the following result: 

10 The O(-) notation here hides th e dependence on a. 
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Theorem 4.7. Let s ^ 1 be an integer and < rj, 7 < 1 be reals. Let G : [N] x [£] — >■ [M] be a 
(4k, e)-expander with e = 0(j 3 rj) and £ Js c • log(iV7/c) (/or rome Zarge enough constant c = G(s) ?/jctf 
can depend on e). Let Ai be the random matrix obtained by multiplying each (non-zero) entry in Ai g by 
random (independent) ±1. Then except with probability ( A , Ai is a (k, 27, -Jrf) weak ^2/^2 system. 



The estimation algorithm in Appendix [H] with Corollary H.3 (where we substitute 77 by rj 2 ) implies the 
following: 

Theorem 4.8. Let s ^ 1 be an integer and < 77, 7 < 1 be reals. Let G : [N] x [£] — > [M] be a 
(4k, e)-expander with e = 0(7 3 r/ 2 ) and £ Js c • log(N/k) (for some large enough constant c = @(s) that 
can depend on e). Let Ai be the random matrix obtained by multiplying each (non-zero) entry in Ai g by 
random (independent) ±1. Then except with probability ( . ) , the following is true. 

There exists an algorithm that given as input S C [N] in time 0(\S\ ■ £ + k/rj ■ log(|i?|)) outputs 0(k/rf) 
items I C S that contain all but yk items i € 5 such that \xi\ > 3-\/r) 2 /k\\z\\2- 



Finally, the proof of of Theorem 4.7 also implies the following: 



Theorem 4.9. Let s ^ 1 be an integer and < 77, 7 < 1 be reals. Let G : [N] x [£] — > [M] be a 
(4k, e)-expander with e = 0(7 3 ?? 2 ) and £ Js c • log(iV /k) (for some large enough constant c = Q(s) that 
can depend on e). Let .M be the random matrix obtained by multiplying each (non-zero) entry in Aic by 
random (independent) ±1. Then except with probability ( fc ) , the following is true. 

.M is (k, £, rj) weak £2^2 system. Further, there exists an algorithm that given as input S C [N] (that 
contains all but Qk/2 elements ofH k+k /„(x)) in time 0(\S\ ■ £ + k/rj ■ log(|5|)) outputs x with the required 
properties. 

4.4.2 Weak Identification to Top-Level system conversion 

We begin with the following observation that follows by repeating the given weak identification matrix s 
times (proof in Appendix G.2[ ): 



Lemma 4.10. Let Ai be a (k,C,,rj) weak identification matrix with (£,p) guarantee. Then there exists a 
(k, 3£, 77) weak identification matrix AI' with (2£,p ( s >) guarantee with s times more measurements. 



By combining Lemma 4.10| and Theorem 4.9 we get: 



Lemma 4.11. Let Ai be a (k, £, rj) weak identification matrix with (£,p) guarantee with m measurements 
and let G be an expander as in Theorem 4.9 Then for any integer s ^ 1, there exists a (k, 3£, rj) weak ^2/^2 
system with 0(m ■ s + M) measurements with failure probability p n (s) + ( , ) 

Next we present a simple yet crucial modification of the weak level to top level system conversion from ||22T| . 

Lemma 4.12. Let the following be true for any < £, 7/ < 1 and integers s,k ^ 1. There is a (k,£,rj) 
weak ^2/^2 system .A/f with 0(s ■ ^~ d rj~ c • k • log(N/k)) measurements^^ failure probability q-^( sk ) and 
decoding time T(k, £, rj, N, s). Then for any e, a > and k ^ 1, there is a (k, e)-top level system with 
failure probability q- n ( k / lo s " ' a fc ) ; 0(e" c ■ k ■ \og(N/k)) measurements and 0(log k ■ T(k, (, rj, 1)) 

decoding time. 



c and d are absolute constants. 
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Proof Sketch. We only sketch the differences from the corresponding argument in II2211 . We will have log k 
stages. In stage i, we will pick a (J^, |, e • p^) weak £2/^2 system with s = 2* / '^( 1 +«) c + 2 + Q anc j "stack" 
them to obtain our final top level system. Using the "loop invariant" technique of Gilbert et al., and argu- 
ments similar to I22L one can design a decoding algorithm that has an approximation factor of 



00 1 



-i i 



as desired. (In the above we used the fact that Yli^i i 1 a = 0(1) for any constant a.) 

We briefly argue the claimed bounds on the number of measurements and the failure probability (the rest of 
the bounds are immediate). Note that at stage i, the number of measurements (ignoring the constant in the 
O(-) notation): 

' • 2* • 2 d • e~ c ■ (-±-) ° -A . Iog(JV2*/fc) ^ -1- . 2 d • ,- c • klog(N/k). 



j j (l+a)c+2+a \jl+a J 2 

The claimed bound on the final number of measurements follows from the fact that Y^S=\ tth = 0(1). 
Next note that the failure probability at stage i is q -n(2 l k/(2H( 1+a '> c + 2 + a ) _ q - { iil+ol)c+2+a ) _ The final 
failure probability is determined by the failure probability at i = log k, which implies that the final failure 
probability is q~ n ( k / lo & a c a fc ), as desired □ 

4.4.3 An Intermediate Result 

In this section, we will present an intermediate result that will be the building block of our recursive con- 



struction (in Section 4.5 ). Let N ^ M ^ 1 be integers and let h : [N] — > [M] be a random map that we will 
define shortly. Let G : [M] x [I] — >• [m] be a (4k, e)-expander. In this section, we will consider how good 
an identification matrix we can obtain from M.Goh (with random ±1) entries. 

Call a map h : [N] — > [M] to be (k' , a)-random if the following is true. For any subset S C [N] of size k', 

the probability that for any i S, h(i) = h(j) for any j G S is upper bounded by O ( fji- a ) • Note that a 

(k' + l)-wise independent random map from [N] —> [M] is (k! ', 0)-random. 

We will prove the following "identification" analogue of Theorem |4. 7 1 (the proof appears in Appendix|l]): 

Lemma 4.13. Let < a, rj, 7 < 1 be reals. Let f : [N] —> [M] be a (k(C~ 5 f]~ 2 + 1), a)-random map with 
M > n((- 6 r]- 2 ■ k^* ■ (\og(N/k)) 2 ). Let G : [M] x [£] -)■ [m] be a (4k, e) -expander with e = 0(j 3 7] 2 ) 
and £ JS c • log(M/A;) (for some large enough constant c = @(s) that can depend on e). Let M. be the 
random matrix obtained by multiplying each (non-zero) entry in M.Gof by random (independent) ±1. Then 

except with probability (jr) , the following is true. 

There exists an algorithm that given as input S C [M] outputs in time 0(\S\ ■ £) 0(k/rf) items I C S that 
contain f '(i) for all but jk items i G S such that \x%\ > 3y'r] 2 /k\\z\\2. 

4.5 Proof of Lemma H31 

See Figure [T] for an overview of the proof of Lemma 4.3 



We note that Theorem 4.8 instantiated with an optimal (random) expander, implies the following: 
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Corollary 4.14. There exists a family of(k, £, rj)-weak identification matrices {.M n } n ^-2 with 
(0(k/rj), {n/k)- n( -^)-guarantee, 0{C & r] 
fication time. 



4 klog(n/k)) measurements and 0{Q 3 r? 1 \S\ Aogn) identi- 



Our main result is that we can recursively construct a weak identification matrix that has efficient identifica- 



tion from the (family) of weak identification matrices guaranteed by Corollary 4. 14 (that by itself does not 
have a sub-linear identification algorithm). The main idea is as follows: using a tree structure, we map [N] 
to successively smaller sub-problems. At the "leaves" the domain size is O(k) and thus one can use any 
identification matrix (including the identity matrix). The main insight is on how to break up the domains at 
an internal node. For illustration purposes, consider the root. We use a uniform and (good) list recoverable 
code (r, N) y^-code C (for some parameter b > 1). (At the end we will use one of the codes from Section[2| 

as C.) More precisely the domain [N] is broken down to r copies of the domain [\/~N] and i E [N] gets 
mapped to C(i)j in the jth child/sub-problem. By induction, we will prove that each of the r children, 
the weak identification problem can be solved efficiently. In other words, for each j € [r] , we will have a 
candidate set Sj C [\/iV], each of size 0(k/rj). As C is list recoverable we can recover a list S that contains 
i. However, this list could be bigger than the required 0(k/rj) bound (though not much larger). To prune 
down this to a list of size 0(k/rf) we use a weak identification matrix A4 to narrow down the list to I in 
running time proportional to \S\. 

Below we state our formal result. 



be a family ofm{n) x n matrices from 



d = f 0(k/rj),p(n) d = f (n/Jfc)- n ^*) 



Theorem 4.15. LetO < £,ot,rj < 1 be real numbers. Let{M.n} n ^r-i 
Corollary 4.14 that are (k, £, r\)-weak identification matrices with I 

guarantee where m(n) = g(£, rj) ■ k ■ log(n/k) for some function g(£, r/). Finally, for any real numbers 
b > 1, 0^p<l and integer r ^ 1, let {C n } n ^i be a family of codes, where C n is a (r, n) w^ code that is 
(p, £, L)-list recoverable in time T(£, n, b). 

Then for large enough N there exists a m'(N) x N matrix Ai* that is (k, £', r\)-weak identification matrix 
with (0(k/r]),p'(N))-guarantee and identification time complexity D(k, N) as follows. In what follows let 



4.13 



A Js fi(£ b rj • k 1 - 2 " ■ (\og{N/k)) ) satisfy the lower bound in Lemma 

There exists two integers h ^ 0(log r log^ N) and M = i\og A N) such that the following hold: 






V log r J 



p'(N) < M ■ 



A 



ifp>0 
otherwise 

-n(Cfe) 



m'(N) ^ O ( g(C, r])-k- log(N/k) ■ M 1 ^ , 



(6) 

(7) 
(8) 



and 



h-l 



D(k,N) = 0(C 3 7]- 2 -M-AAogA) +J2(r j T(0{k/ri), 'VW, b) + O K~V 2 • L • log Vn 

3=0 



(9) 



We now instantiate Theorem 4.15 with Corollary 4.14 and the list recoverable code from Lemma 2.5 



to 



prove Lemma 4.3 
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4.14 



Corollary 

°(l°gfcolog(iV/fco) 



Proof Sketch of Lemma \4.3\ Let d = 1 + l/a. For the code from Lemma 2.5 we have p = 0, r = d, 
b == d/( d - 1) , £ = 0(fc/»7) and L = 0(A; 1+ % 1+Qi ). We apply Theorem |4. 15] with the family from 

k ■ k$(log(N/k) 2 ). Note that this implies that M = 

and Lemma 



with C = ft and A = e((~ b V 
N). The claimed bounds then follow from the bounds in Theorem 



4.15 



4.16 



We also need to take into account the randomness needed to for the random ± signs. However, it is easy 
to check that in each of the M nodes, we need poly(£, rf) ■ A;-wise independence (see Remark 1.2 1. Storing 
these random bits can be done within the claimed space usage. □ 



In the rest of the section, we prove Theorem 4.15 



Proof of Theorem 4.15 The proof is essentially setting up the recursive structure (which we will state with 
the help of a tree) and to define the overall identification algorithm in the natural recursive way. We then 
verify that all the claimed bounds on the parameters hold. The only non-trivial part is the bound on £' for 
the case when p > 0, where we use a careful union bound to obtain the better bound. 

Define 

h = [Tog,. log A N] , 

and 

r" 



N 



„h+l _ i 



1 



0(log A N). 



To prove Q, we will prove 

h 

to prove Q, we will prove 



pr+l 
j=0 \ 2 . 




ifp>0 
otherwise 



(10) 



p'(N) < ]T 



r J p 



3=0 



X) < A" •/;(.!) < A'- ( j 



-Cl(Ck) 



(11) 



and to prove ([8]), we will prove 



m'(N) = g((, r])-^2r j -k- log ( b ^N/k) < O (g((, rj) ■ k ■ log(N/k) ■ Af^ j . (12) 

The construction. We will present the recursive construction via a tree. (See Figure[T]for an illustration.) 
In particular, for ease of exposition, we assume that N is a power of 2. Consider a complete r-ary tree with 
height h. We will label the tree as follows: associate any node at level ^ j ^ h with the domain [ y/~N], 
Note that the leaves are associated with the domain [A] and the tree has J\f nodes. 

Let v be an arbitrary node at level j. Order the r outgoing edges in some (fixed) order and associate 



the uth edge with the uth coordinate of the codewords in C , 

4> w : [N] -> [ '"" 



/N 



In particular, this defines a function 



N] for each node w on the (j + l)th level. Let u\, . . . , Uj be the order of the edges 
used from the root to v in the tree. Define <j> v {i) to be the symbol obtained by successively applying the 
corresponding code in each of the j levels in the path from the root to v. In particular, 



M*) = C ^-]/= (" " " C fyN (^(0«i) U2 • • • ) i 
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For every vertex v at level j, we associate with it a m( vN) x N matrix M. (v) as follows. For any i G [N], 
the ith column in M. (v) is the same as the 0„(£)'th column in M. bJ 

Let M.' be the matrix that is the stacking of all the matrices M.{v) for every node v in the tree. (The order 
does not matter as long as one can quickly figure out the rows of M.{v) from M.'.) As the final step, we 
will randomly rearrange the columns of M! . In particular, let / : [N] — > [N] be defined as follows: for 
every i G [N] = {0, l} 10 ^, define f(i) to be i with each of the log N bits (independently) flipped with 
probability 1/2. Note that this implies that for every i G [N], f(i) is uniformly distributed in [N]. 

Finally, define M* to be the matrix, where the ith column (for any % G [JV]) is the /(i)'th column in M! . 

The identification algorithm. The algorithm will be described recursively. First we assume that location 
i is mapped to f(i) and hence we can now think of M.* as M! . We will start with a leaf £ (i.e. a vertex 



at level h). We claim that the matrix A4(£) is of the form required in Lemma 4.13 Indeed, since each of 



the codes used in the construction are uniform, (pi ma ps elements in [N] uniformly at random to [A], Thus, 



def 

given M. {£) • x, we use the algorithm in Lemma 4. 13 (with S = Si = [A]) and "send" 1% to its parent node. 



Note that \I(\ < 0(k/rj). 

Now consider a node v at level ^ j < h. By induction for each of its r children, we will receive subsets 

Ii, . . . ,I r Q [ b y/~N] . Note that these form a valid input for list recovery of C b j We run the list recovery 

v N 



algorithm on 1%, . . . , I r to obtain a subset S v C [ yN], Aga in usi ng the arguments similar to the paragraph 



above, we can assume we can run the algorithm from Lemma 4. 13 to obtain a subset I v C [ y/N] and "send" 



it to its parent. 

If w is the root of the tree then the final output is I w . There is one catch: the output I w is the collection of 
f(i)'s for the appropriate indices i. However, the way / is defined it is not a one to one function and thus, 
we cannot just apply / _1 on to the indices in I w . We will return to this issue in Section • 



4.5.1 



Correctness of the construction. Consider an item i G [N] such that | x\ \ > 3y/ri/k\\ z \ \ 2 . Now consider a 
node v at level j in the tree and let I\ , . . . , I r C [ b yN] be the sets passed up the r children of v during the 

identification process above. Further, assume that for at least pr values u 6 [r], C b j (4> v (i)) u G I u . Then 

vJV 

since C b j is (p, 0(k/rj), L)-list recoverable, we will have 4> v (i) G S v . Then when we run the algorithm 



/N 



from Lemma |4~T3 on S v . Now, two things can happen. Either, 4> v (i) G I v or not. In the latter case, we will 



lose i but we will account for such items i in the next part. In the former case, let v be the j'th child of vertex 
u. Then note that 4> v {i) = C b j-\ {4> u {i))ji and thus, we can apply the argument inductively to u. In other 
words, if i is never lost then we will have i in the final output I w , assuming we can prove the base case. For 
the base consider any leaf £. Since we picked Se = [A] (and hence, trivially (j)e(i) G Si), if the algorithm 



from Lemma 4. 13 does not lose i, then cfiiii) G Ii as desired. 



Analyzing the number of lost heavy hitters. If p = 0, then it is easy to see that the identification 
algorithm above can lose at most J\f • (C&) of the "heavy enough" items. Thus, in this case the bound of 

(' ^ J\f ■ (" is obvious. 

Next, we consider the case when p > 0. Define e = pr + 1. We will argue soon that every heavy item i that 
is not in /„, it has to be the case that there exists a level j such that for at least (e/2) J nodes v such that 4> v (i) 



is lost by the algorithm from Lemma 4.13 when it is run on S v . In such a case we will assign item i to level 



j. Note that in the previous naive argument, we count each i at least (e/2) J times at level j. In total, level 
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j can lose at most r° ■ £ • k items, which implies that each level can lose at most ( £ J ■ ( ■ k distinct items 

at level j. Summing over all the levels and noting that the total number of lost items is (' • k, we obtain the 
first inequality in ( lOl. It is easy to verify that ^2j =0 {r/(e/2)y ^ (2r/e) ( h \ from which the equality in 



( 10 1 follows by substituting the values of h and e. 



Finally, we argue that every lost item i I w is associated with a level j, where at least (e/2) J invocations 



of the algorithm from Lemma [4. 13| lose i. Towards this end, call a vertex v bad internal node for i if the 

(i) but 4> v {i) $l I v . Call the vertex v bad leaf 



invocation of the algorithm from Lemma 4. 13 does not lose 



for % if the invocation of the algorithm from Lemma 4.13 itself loses 4> v (i). Note that is a node v is a bad 
internal node for % then at least e of its children are either a bad internal node of a bad leaf for i. In other 
words, the set of bad internal nodes and bad leaves for i form a tree Tj with degree at least e such the leaves 
of Ti are exactly the bad leaves for i. Finally, note that we want to show for every lost i, there is a level 
j with at least (e/2) J ' leaves in level j. This now is a simple question on trees that we prove by a simple 
greedy argument. 

W.l.o.g. assume that every internal node in Tj has degree exactly e. If T, has one node then we are done. If 
not, consider the e children of the root. If at least e/2 of these are leaves then we are done. If not, the root 
has at least e/2 internal nodes. Thus, if we have not stopped for j levels, we will have at least (e/2)-?' • e 
children at level j + 1. If at least half of them are leaves then we have at least (e/2) jf+1 leaves in level j + 1 
otherwise we have at least (e/2) Jr+1 internal nodes. We can continue this argument in the worst-case till 
j = h but all nodes at level h have to be leaves, which means that the process will terminate with at least 
e(e/2) h ~ 1 leaves. 



Analyzing the failure probability. The first inequality in ( 1 1 1 follows from the union bound (recall that 
there are r- 7 nodes at level j). The second inequality follows from the fact that p(n) is decreasing in n and 
the final inequality follows from the definition of p(-). 



Analyzing the number of measurements. Note that m'(M) = X^=o 



•j . 



nil 



N), which proves the 



equality in (|12b. The inequality follows by bounding the sum ^i=o( r /^) J as we ^id earlier. 



Analyzing the identification time. Finally, in the identification algorithm for each node v (at level j < h), 



4.13 



other is the algorithm from Lemma 
The sum in ¥h comes from noting that 



4.14 



b-> 



we run two algorithms: one is the li st rec overy algorithm (whi ch tak es times T(0(k/rj), y/N, b)) and the 

takes time 0((- 3 r]- 2 \S v \ ■ log "Vn). 



which by Corollary 
S\ ^ L. Finally, at level j = h, we pick S v =- [A] and hence 

□ 



Lemma [4. 13| and the fact that there are at most J\f nodes at level h justifies the first term in (|9]). 



4.5.1 Inverting the indices 



We are done with the proof of Theorem 4.15 except for taking care of the issue that the random function 
/ : [N] — >• [N] chosen earlier might not be one to one. 

We now mention two ways that we can modify the proof above to take care of this. The difference is in the 
final identification time we can achieve. 

For simplicity we will assume that N is a power of 2. Thus it makes sense to talk about fields F^r and 
¥ N 2. We will also go back and forth between the field representation and [N] and [iV 2 ] respectively. Let 

d = k{(- 5 7]- 2 + 1). 
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With fi(JV) storage one can solve this problem for any code with the required properties. See Appendix pi 
for the details. 

Next, we present a specific but cheap solution. For this part, we will make a strong assumption on our list 



recoverable codes C- we will assume that we are looking at the Loomis-Whitney code (from Lemma 2.5 1 
or Reed-Solomon codes (from Lemma [231 ) ■ 

We pick g : [N] — > [N 2 ] randomly by picking a random degree d polynomial over Fjy and define g(i) to 
be the evaluation of this polynomial at i. It is well-known that such a random hash function is (d + l)-wise 
independent (see e.g. ifTTTD . Finally define f(i) = (i, g(i)). 

We have to tackle three issues. 

First, we now have to first compute the matrix M! with N 2 columns (then we pick the columns of M! 
indexed by (i, g(i)) for i G [N]) instead of N columns as we have used so far. This however, is not an issue 
since we have a family of matrices M. n . This change results in some constants in the parameters of M* 
changing but it does not affect the asymptotics. 

Second, note that every f(i) is unique and recovering i from f(i) can be done in 0(1) time. Thus, we will 
not need any additional space and will add an additive factor of 0(k/rj) to the identification time. 



Finally, we have to show that this new / is fine with respect to applying Lemma 4.13 In particular, we 
have to show the following: consider any node v in the recursion tree. Then we will show that for the (new) 
(j) v : [N] — > [M] is (d, a)-random (for some small a). By our assumption on C, note that this implies that 
the log M bits of <j> v (i) for any i are from the 2 log N bits from (i, g(i)). Unfortunately, for small M, all 
these log M bits might all be from the deterministic part (i.e. these are some log M bits from the log N bit 
representation of i), which implies that there will be lot of collisions (in fact N/M indices might collide). 

However, the main insight is that when we divide the domain at a node v into its r children, we have full 
freedom in how we divide up the bits. (Recall that the code C just repeats some subset of message bits as a 
codeword symbol.) The bad case mentioned above is only true if we do not do this division carefully. Thus, 
the idea is to do this division carefully so that the "g(i)" part of f(i), which has full randomness, is passed 
along. 

To fully implement this, we will need to change g a bit. In particular, pick t = |~^] . Then pick g to be 
random degree d polynomial over ¥ N t-i. Now consider a code in the recursive tree structure. W.l.o.g., let 
us assume that C is an (r, N l ) 6^ code. Let j G [r], then we want to argue that C((i, g(i)))j is pretty much 
random. 

We first consider the case that C is the Loomis-Whitney code. First think of the message as coming form 
[N] x [iV* -1 ] (recall that f(i) = (i,g(i)) and hence is from this domain). Consider a message (i,£). We 



partition it up into d symbols from [\^N] x [V-ZV* -1 ]. I.e. each symbol has the same "proportion" of 
randomness. In other words, a symbol (i', (!) in the codeword for (i, £) will have i' to be deterministic and 
the £' part to be completely random. 

Next we consider the case when C is the (r, N) b/j, Reed-Solomon codes. First think of the message as 
coming form [N] x [iV* -1 ] (recall that f(i) = (i, g(i)) and hence is from this domain). Consider a message 
(i, £). We partition it up into b symbols uo, • • . , u&-i G [y/N] x [( -v/iV)' -1 ]. I.e. each symbol has the same 
"proportion" of randomness. Let us associate [\/~N] with the field ¥ q , where q = \/~N. Recall that we need 
to choose r evaluation points for the code C. Since we have the full freedom in picking these elements, pick 
distinct f3%, . . . , j3 r G ¥ q . (Recall that r is a constant, so this is always possible.) Now recall that the C(i, £)j 
will be the element 

6-1 
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Let us think of u s as an element of F* in the standard notation. I.e. u s = 2^ a =o u< s ' 7*> where 7 is a root of 
an irreducible polynomial of degree t over ¥ q (and the one that defines the standard representation of ¥ q t). 
Then the above relation can be written as 

6-1 ft-\ \ t-1 /b-l \ 

c(M)i = E^l- E<-t° =E7 a - £#■«* • 

s=0 \o=0 / a=0 \s=0 / 

Since £ = g(i) was completely random, the above implies that if one thinks of C(i,£)j as an element of 
¥ g x Fgt-i, then the part in ¥ q t-i will be completely random. 

The above implies that when C is the Loomis -Whitney or Reed-Solomon code, for any 4> v {i) has as its 
last (1 — a) log M of its bits to be uniformly random. We will now show that <j> v is (d, a) -random, which 



will be sufficient to apply Lemma 4.13 Towards that end, let us fix a subset S C [N] of size d. Consider 
i G [N] \ S. Consider the values {4>v{j)}j&s- Note that we do not have any control over the first a log M 
bits in these values- so in the worst-case let us assume that they (as well as 4> v {i)) have the same a log M 
bit- vector as its prefix. However since, g is d-wise independent, we have that the (1 — a) log M bit suffixes 
°f {4>v U)}jeSu{i} are a U uniformly random and independent. Thus, even conditioned on the values of 4> v (j) 
for every j G S, the probability that <f> v (i) = 4> v {j) for some j G S, is upper bounded by 

d d 

2(1— a) log M ~ M l ~ a ' 

as desired. 

We will call this scheme above SCHEME-2. Note that we have show that SCHEME-2 works and 

Lemma 4.16. SCHEME-2 adds 0(k/rj) to the decoding time in Theorem 4.15 and needs 0(£ rj~ 2 • k ■ 



log N) bits of space. 

The claim on the space comes from the amount of randomness needed to define g (which we need to store). 

4.6 Proof of Lemma \4A\ 



In this section, we do a better analysis of the earlier recursive construction (from the proof of Lemma 4.3 1 



assuming the list recoverable code also leads to a limited-wise independent source, which lead us to the 



proof of Lemma 4.4 For our purposes, this code will be the RS code. 



We begin with the following "inductive step" of the recursive construction. 



Lemma 4.17. Let A4 be anmx tfn matrix from Corollary 4.14 that is a (k, £, r\)-weak identification matrix 
with (£,p) guarantee and identification time T2. Let C be an (r, n) b^ code that is (p, £, L)-list recoverable 
in time T<i- Assume that the following holds: 

P<{0\ (13) 

Then there exists a (rm) x n matrix M* that is (k, £', r\)-weak identification matrix with (L,p') guarantee 
with identification time r -T\ +T2 with 

C < — (14) 

P 

and 

P < P pr/ \ (15) 

where the codewords in C form an a-wise independent source for some a ^ pr/2. 
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Proof. The basic idea is to first encode each index i G [n] with C and then use a copy of Ai for each of 
the r positions in C(i). For identification, we first run the identification algorithm for .M on each of the r 
positions and then combine the result using the list recovery algorithm of C. The main insight in the analysis 
of the algorithm is that the ( fraction of the heavy hitters dropped in each of the r are a-wise independent. 
Thus, by Chernoff the probability that a heavy hitter is dropped more than pr times is exponentially small. 
Next we present the details. 

The construction. For j G [r], let M.j denote the m x n matrix obtained as follows. For every i G [n], 
the i'th column in M.j is the C(i)j'th column in M.. Let M! be the stacking of the matrices Mi, ■ ■ ■ , M. r . 
Let / : [n] — > [n] be a completely random map. Then the final matrix M* is obtained by putting as its z'th 
column the /(i)'th column of M.' . 

It is easy to verify that M* has rm rows, as desired. 

The identification algorithm. We will first assume that the index i has been mapped to f(i). (Ultimately, 



we will need to do the inversion. We address this part separately in Section 4.5.1 ) Thus, we can assume that 
we are working with M! . Now if we consider the M.j part of M! (for any j G [r]) and run the identification 



algorithm for M.j (from Corollary 4.14), then we will get a list Sj of the values C(h)j for all but (k heavy 
hitters h G [n] . We then run the list recovery algorithm for C on Si , ... , S r to obtain our final list of 
candidate heavy hitters. Note that this algorithm takes time r • T\ + T2 as desired. Next, we analyze the 
correctness of the algorithm. 



Correctness of the algorithm. Consider a heavy hitter h G [n]. Note that it will not be output iff C(h)j G" 
Sj for > pr positions j G [r] . Let (' be the fraction of heavy hitter we miss due to the above condition. Next 
we bound £' and the failure probability p'. 

Call a codeword position j G [r] to be bad if the identification algorithm for M. loses more than (k heavy 



hitters when we run the algorithm for JAj. By Lemma C.2 the probability of having > pr/2 bad positions 
is upper bounded by 



/ e p r \P r / 
\pr/2j 






where we use the fact any a ^ pr/2 positions in a random codeword in C are independent and the inequality 



follows from ( 13 1. We will show next that conditioned on there being at most pr/2 bad positions, at most 



Q'k heavy hitters get lost, where Q 1 satisfies ( 14 1. Note that this implies that the overall failure probability p' 
is upper bounded by the above probability, which then proves fT5) . 

To bound (', call a heavy hitter h G [n] corrupted if C(h)j Sj for at least pr/2 of the good positions 
j. Note by the fact that C is (p, £, L)-list recoverable, if h is not corrupted then, it is going to be output by 
our identification algorithm. Thus, we overestimate (' by upper bounding the number of corrupted heavy 
hitters. A simple counting argument show that number of such corrupted heavy hitters cannot be more than 
2(k/p, which proves ( 14 ). □ 



Recall that an (r = 6/(1 — 2p) + 1, n) w^ RS code by Lemma D.4 is an (p, £, £ r )-list recoverable. Further, 
recall that such a code is 6-wise independent. If p < 2/5, then we will have Since b > pr/2, we have 
sufficient independent to use this code as C in Lemma |4.17| 

Using the RS code above with p = 1/4 in Lemma 4.17 one can prove Lemma 4.4 
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Proof Sketch of Lemma \4.4\ The proof is very similar to that of Theorem 4.15 and here we just sketch 
how the current proof differs from the earlier one. As in Theorem 4.15 we start off with {A4 n } n ^^-2 — 



def 



def 



a family of m(n) x n matrices from Corollary 4.14 that are (k, (,, rj) -weak identification matrices with 



£ = 0(k/r]),p(n) = (n/k) ^K fc ) j -guarantee where m(n) = g((,i]) ■ k ■ log(n/k) for some function 



(r d = f 2b+l,n) b/ 



However, unlike Theorem 4.15 we will pick a specific family of code. In particular, let \C n \ n >\ be the 
Reed-Solomon code: we will later pick b to be 0(2 l / £ ). Note that Lemma : 



def 



2.6 



implies 



that this code will be (p = 1/4, £, ^ 2fe+1 )-list recoverable in time 0(£ 2b+1 b 2 log 2 n). 

Other than the specific family of codes, the rest of the construction is exactly the same as in the proof of 



Theorem 4.15 to obtain for large enough N a m'(N) x N matrix M.* that is (k, £',r/)-weak identifica- 
tion matrix with (0(/c/?7),p / (A r ))-guarantee and identification time complexity D(k,N) as follows. (The 



analysis is pretty much the same as in the proof of Theorem 4.15 except in the analysis of £' and p' in the 



"recursive" step instead of using the arguments in proof of Theorem 4.15 we use Lemma 4. 17 ) 

In what follows let A ^ Sl(£~ 6 ?]~ 2 • A; 1 - 2 " • (log(iV/£;)) 2 ) satisfy the lower bound in Lemma 
exists two integers h ^C 0(log r log^ N) and J\f = ilog A N) such that the following hold: 



4.13 



There 



c'^c- 



p'(N) < 



A 



-n((-k-(p/4:) h -r h ) 



i°g£ 



m'(N) < O g((, rj)-k- log(N/k) ■ Af^ , 



(16) 

(17) 
(18) 



and 



h-l 



D(k, N) = (C~V 2 ■ N ■ A ■ log A) + J^ {( r / b ) j 0{(k/ri) r log 2 N) + (C~V 2 -L-log "Vn 

3=0 

(19) 
Noting that r h = log^ N and that (0(l)/p) h = _\fO(iog(i/ P )/io g r) along wkh the choice oi A = £-6^-2 . 
k r ■ log 2 (N/k) and some simple manipulation implies the following parameters: 



rO 



'log(l/p) N 



C' ^ C.-M v logr 

f \ log r J n 



p'{N) < 



N ^ -a({k/N 



log- 



m'(N) ^ O g((, n)-k- log(N/k) ■ M 1 ^ , 



and 



D(k, N) = (C n -Af-A-logA)+ C ~\ l~ {k/vf ■ poly(logiV). 

'log(l/p) s 



(20) 

(21) 
(22) 

(23) 



To obtain the claimed parameter p' one has to apply Lemma 



4.10 



rO 



with s = M v logr ' /(. Finally, to 



rO 



'log(l/p) N 



obtain the claimed value of ( one has to re-scale ( by a factor of 1/JV ^ logr K (To verify the dependence 
of these parameters on e, recall that we picked p = 1/4, b = 0(2 l / £ ) and r = 26 + 1.) □ 
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A Threshold for failure probability 

We will prove the following result: 

Lemma A.l. If there exists a (k, C)-top level system with m measurements and failure probability p then 
for every integer g ^ 1, there exists a (k, V3 • C)-top level system with m ■ s measurements and p^ s > failure 
probability. 

The above follows easily from the following result: 

Lemma A.2([19], Lemma5.3). Letyi, . . . ,y s £ M. n be independent random vectors such that Pr[||yj||2 < 
D] ^ 1 — pfor every i G [s]. Let y G M. n be obtained by taking the component-wise median o/{yi, . . . , y s . 
Then 

Pr[||y|| < V3 • D] > 1 - (4ep) s/4 . 
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In the rest of the section we prove Lemma A. 1 using Lemma A.2 in the obvious way. Let Ai , A be the given 
top level system. The new top level system A4' , A' is defined as follows. M! is just s independent copies of 
M. (call the copies .Mi, ... , A4 S ). The new decoding algorithm is defined as follows: let zi, . . . , z s be the 
output of A on the s copies of M., i.e. z, = A{Mix). Then A'(At'x) is defined to be the component wise 
median of z i , . . . , z s . Lemma A. 1 follows by applying Lemma A. 2 with yj = x — z j and D = C • || x— x^ 1 1 2 , 
where x is the original signal. 

B Expander Basics 

We will need the following three simple lemmas. 

Lemma B.l. Let G : [N] x [£] -> [M] fee a (2t,e) -expander. Let S,T C [j\T] swc/2 tfjaf |T| = t, |S| ^ t 
araa" 5 n T = 0, then 

\v(S) n r(r)| < \£(T(S) n r(r))| < 4|5|fe. 



t 



many 



Proof. The first inequality is obvious so we prove the second inequality. Divide 5 into b 

disjoint subsets S\,...,Sb where all Si's but possibly 5& have size exactly t. (For what follows, by adding 

extra elements to S if necessary, we will assume that |S&| = t.) Note that b ^ 2|5|/i as \S\ ^ t. 

Fix an i G [b] and note that as G is a (2t, e) expander, we have 

\T(TUSi)\ ^ 2t£(l-e). 

As I IT ( 5i ) I ^ t£, the above implies that 

\T(T)\T(Si)\>U(l-2e). 

The above in turn implies that 

\£(T(T) \ T(Si))\ > \T(T) \ T(Si)\ > t£(l - 2s). 

Since \£ (T)\ = t£, the above implies that 

\£(T(T) n T(Si))\ < t£- U{1 - 2e) = 2et£. 

Finally, as S is the disjoint union of S\, . . . Sb, we have 

\£{T{T) n r(5))| < 2bet£ < Ae\S\£, 

as desired, where in the last inequality we have used the fact that bt ^ 2\S\. □ 

Lemma B.2. Let a ^ 2 be an integer and let G : [N] x [£] —>■ [M] be a (t,e < 1 / (2a)) -expander. Let 
R C [M] be any subset of size \R\ ^ jt£for some 7 > 0. Then there are < 2a r yt elements i S [N] such 
that \T(i) n R\ ^ £/a. 

Proof. For the sake of contradiction, assume that there exists a subset L C [N] with \L\ = 2a^t such that 
for every i G L, |T(i) n i?| > £/a. Consider the neighbors T(L) \ R. Note that by assumption each i £ L 
has at most (1 — 1/a) ■ £ such neighbors. Thus, we have 

|T(L)| < 2o(l - l/o)7t£ + |i2| < 2a(l - l/a) 7 ^ + jt£ = ( 1 - -*- J • \L\ ■ £ < (1 - e) • \L\ ■ £, 

which contradicts the assumption that G is a (t, e < l/(2a))-expander. □ 
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The following is a by-product of the well-known "unique neighborhood" argument (it also follows from the 



proof of Lemma B. 1 1: 



Lemma B.3. Let G : [N] x [£] — > [M] be a (t, e)-expander. Then for any subset S C [N] with \S\ ^ t, we 
have that at most 2e\S\£ vertices j G [M] each of which has more then one neighbor in S. (A vertex in [M] 
adjacent to exactly one vertex in S is called a "unique neighbor" of S.) 



C Probability Basics 

Given any two vectors x,y£ W 1 , we will denote its dot-product by (x, y) = Y17=i x i ' Vi- 
We will use the following well-known result: 

Lemma CI. Let x = (xi, . . . , x n ) G R n and r = (7*1, ... , r n ) G {—1, l} n be a pair-wise independent 
random vector. Then 

Pr[|(x,r)| > a -||x|| 2 ]^i. 

We will also use the following form of the Chernoff bound: 

Lemma C.2. Let X±, . . . , X n be random iid binary random variables where p = Pr[JQ = 1]. Then 



Pr 






/ epn \ a 



for any a > pn. The above bound also holds ifX\, . . . , X n are a-wise independent. Further, if the random 
variables are k < a-wise independent, then the upper bound is (epn/a) ] 12 | 

D Coding Basics 

In this section, we define and instantiate some (families) of codes that we will be interested in. We begin 
with some basic coding definitions. 

We will call a code C : [N] — > [q] r be a (r, A r ) <? -codeFj Vectors in the range of C are called its codewords. 
Sometimes we will think of C C [q] r , defined the the natural way. 

We will primarily be interested in list recoverable codes. In particular, 

Definition D.l. Let N, q,r,£,L Js 1 be integers and ^ p ^ 1 be a real number. Then an (r, N) q code C 
is called a (p, £, L)-list recoverable if the following holds. Given any collection of subsets S\, . . . , S r C [q] 
such that \Si\ ^ I for every i £ [r], there exists at most L codewords (ci, . . . , c r ) G C such that q G Si for 
at least (1 — p)n indices i G [r]. Further, we will call such a code recoverable in time T(£, N, q) if all such 
codewords can be computed within this time upper bound. 

Further, we will call an (r, N) q code to be uniform, if for every % G [r] it is the case that C(x)i is uniformly 
distributed over [q] for uniformly random x G [N]. In our construction we will require codes that are both 

12 The statement of the Chernoff bound is pretty standard. The claim on the independence, follows e.g. from Lemma 3 and 
Theorem 2 in 1 25 1 . 

13 We depart from the standard convention and use the size of the code N instead of its dimension log N: this makes expressions 
simpler later on. 
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list recoverable and uniform. Neither of these concepts are new but our construction needs us to focus on 
parameter regime that is generally not the object of study in coding theory. In particular, as in coding theory, 
we focus on the case where N is increasing. Also we focus on the case when q grows with N, which is 
also a well studied regime. However, we consider the case when r is a. fixed. In particular, our "ideal" code 
should have the following properties: 

1. q should be as small as possible. 

2. r should be as small as possible. 

3. L should be as close to £ as possible. 

4. p should be as large as possible. 

5. T(£, N, q) should has poly-logarithmic dependence on both N and q (and at the same time have as 
close to linear dependence on £ as possible). 

Next, we present three codes that we will use in our constructions (that achieve some of the properties above 
but not all of the above). 

Split code. In this case we assume that N is both a power of 2 and a perfect square. Given a x S [N], 
think of it as a bit vector of length log N. Let x\ (x 2 resp.) be the first log N/2 (last log N/2 resp.) bits 
of x. Then define C sp ii t (x) = (ari, x%). This is of course a trivial code and we record its properties below 
for future use. The list recovery algorithm for this code is very simple: given Si and S2 as the input, output 

Si x S 2 . 

Lemma D.2. The code C sp nt is a uniform code that is (0, £, £ 2 )-list recoverable code. Further, it is recov- 
erable in 0(£ 2 logN) time. 

Loomis- Whitney code. We now consider a code based on the well-known Loomis-Whitney inequality. 
Let d ^ 2 be an integer and assume that N is a power of 2 and y/~N is an integer. Given x E [N] think of it 
as (xi, . . . , Xd) G [\/N} d . Further, for any i € [d] define x-i = (xi, . . . , Xi-i,Xi+i, . . . , Xd) & [\fN] d ~ 1 . 
Then define Clw^Oe) = ( x -i> • • • >#-d)- Note that for d = 2, we get C sp ii t . The Loomis-Whitney 
inequality shows that Cuw(d) i s a (0) A ^ d//( - rf_1 ^)-list recoverable code. We also show how to algorithmically 
achieve this bound in time O^'^ 1 ^). This implies the following: 

Lemma D.3. The code CysnM) is a uniform code that is (0, £, £ d '^ d ~ l ')-list recoverable code. Further, it is 
recoverable in 0(1 ! ' ' - 1 ' log N) time. 

The above follows from Theorem lE.il 

Reed-Solomon code. Finally, we consider the well-known Reed-Solomon (RS) codes. In particular, let 
b ^ 1 be an integer and let q = \/N be a prime power and consider Crs : [N] — > [\^N] r . There are 
known results on list recovery of Reed-Solomon codes but they need b/r = O {!/£), which is too weak for 
our purposes. Instead we first note that we can always first output Si x S2 x • • • x S r and then for each 
vector check whether it is within Hamming distance of pn of some RS codeword. E.g., one can use the 
well-known Berlekamp Massey algorithm that does unique decoding for p < 1/2(1 — fo/OFj Further, it 

14 One could potentially use list decoding to recover from even more errors but that does not seem to buy much for our application. 
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is well-known that RS codes are linear codes and it is well-known that linear codes are also uniform. This 
implies the following: 

Lemma D.4. Let p < 1/2(1 — b/r). Then the code Crs is a uniform code that is (p, £, £ r )-list recoverable 
code. Further, it is recoverable in 0(£ r r 2 log N) time\ [5 \ 

E Constructive Proof of Loomis Whitney Inequality 

E.l Notations 

We begin with some notations. Let E denote an arbitrary discrete "alphabet." We will consider subsets 
S C E d . For any subset T Q [d], we will use St to denote the vectors in S projected down to T. Further, 
for any integer ^ i ^ d, we will use (U) to denote the set of all subsets of d of size exactly i. 

Next, we define the "join" operators. Given any two subsets T\, T 2 C [d], and two sets of vectors V\ C E Tl 
and V2 C H T ' 2 and any subset G C E TinT2 , we define the join ofV\ and V% over G, denoted by V\ ng V2, 
to be the set of vectors (ux,a, u 2 ) in S TlUT2 = E Tl \ r 2 x £TinT 2 x E r2 \ T \ where a £ G, («i,a) G V 1 and 
(a, U2) G V2. The. join ofV\ and Vi (without G as an anchor), denoted by V\ M V2, is the set of vectors 
(«l,a,« 2 ) in S TlUT2 = S Tl \ T2 x £ TinT2 x S T2 \ Tl , («i,o) G V x and (a,u 2 ) G V2. In particular, when 
T\ n T 2 = 0, 14 n V2 is simply V\ x V 2 whose coordinates are indexed by T\ U T 2 . 

E.2 Projections of size d — 1 

In this subsection, we will consider the case when the projections are over [d] \ {i} for every i£ [d], We 
will prove the following result: 

Theorem E.l. Let d ^ 1 be an integer. For each i E [d], let Si C £^J\w b e given finite sets where 
\Si\ = kifor some integer k{. Let S C E" &e the set of vectors such that Su\\u\ Q Si for every i G [d]. 

Y ie[rf] 
Furthermore, the "join " S can be computed from the inputs {Si}i & u] i n time Old- d ~\/Y[i^\d] hi + Siefdl ^* ) ■ 

E.2.1 Facts about certain labeled trees 

Let T be a binary tree. For every internal node v G 7", let its left child be denoted by v^ and its right child 
be denoted by vr. Further, the subtree rooted at any node v G T will be denoted by T(v). Finally, let £(T) 
denote the set of leaves in T. 



To prove Theorem E.l we will consider the following labeled trees. Given d ^ 2, consider any binary tree 
T with d leaves where each node v G T is labeled with a subset C(u) C [d\. Without loss of generality, 
we use the numbers {1, 2, • • • , d} to index the leaves C(T), i.e. each £ G [d] is identified with a unique leaf 
node of 7" '. The labeling is done as follows. 

• Each leaf £ £ C(T) = [d] is labeled with the set C{£) = [d] \ {£}. 



15 The r 2 log 2 iV factor follows from the fact that the Berlekamp Massey algorithm needs 0(r 2 ) operation over ¥ q , each of which 
takes 0(log 2 q) time. 
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• Each internal node v has label C(v) = C(v\) n C(vr). 

We record the following simple properties of such labeled trees: 

Lemma E.2. Let d Js 2 be an integer and let T be a binary tree with d leaves labeled as above. Then the 
following are true: 

(i) For any internal node v, C(i>i,) U C(vr) = [d]; and 
(ii) For the root r ofT, C(r) = 0. 

Proof. By induction, it is easy to see that for any node v, 

C(v)= P| C(£). (25) 

teC(T(v)) 

The above immediately implies (ii) as fl^efdl (^ \ 'M') = ^ ^ or an y ^ ea ^ ^ reca ^ that G(£) = [d] \ £, which 
along with ((25) imply that C(v) = [d] \ C(T(v)). This along with the fact that £(T(ul)) n C(T(v R )) = 
imply (i). D 

E.2.2 Proof of Theorem lETTl 
For notational convenience define 




We prove both parts by presenting an algorithm to compute S from its potential projections Si, i G [d]. Let 
T be an arbitrary labeled binary node with d leaves as described in the last subsection. Every node v will be 
associated with its labeled subset C(v) as well as two auxiliary sets ^>{y ) C S^ and Ii(v) C E^W. The 
set ty(v) is supposed to contain the candidate vectors some of which will be members of the final output 5. 
The set U(v ) will be a superset of the projection (S \ ^(v)) C ( v y 

For each leaf £ G £(7~), define ^(£) = and H(£) = Si- We next describe how to algorithmically compute 
the sets ^(v) and H(v) for internal nodes v recursively from the leaves up to the root. 

Let v be any internal node in T whose left and right children's auxiliary sets have already been com- 
puted. Without loss of generality, assume that Ii(vCjc( v ) = ^( v r)c(v)- (If not > we can simply re- 
move the elements from H(vl,) and n(t>R) whose projections on to C(v) lie in the symmetric difference 
^(vl)c(v) An(%) c( „).) 

When v is not the root, partition U(vl)c( v ) i nto two sets G an d B sucn that for every vector u in the "good" 
set G, the number of vectors in H(v-Cj whose projection onto C(v) is u is upper bounded by , n F ,, — 1. 
The "bad" set B is II(i;l)c(u) \ G. Finally, compute 

y(v) = (u{ VR ) M G n(v L ))uy(v L )uy(v R ) 
n(v) = b. 

When v is the root, we compute ^f(v) = (II(-ur) n IL(v^)) U ^(^l) U VP(vr,) and H(v) = 0. By induction 
on each step of the algorithm, we will show that the following three properties hold for every node v G T: 

1. (S\$(v)) c(() cn(«); 

31 



E.2 



2. |*(v)| < (|£(7»)| - 1) • P; and 

3. |n(v)| < mm [mm eeC{T{v)) k e , p^^U 

Assuming the above are true, we first complete the proof of the theorem. Let r denote the root of the tree T. 
By property 1, (S \ *(r i )) c(rL ) C U(r L ) and (S \ *(rfl))c(r K ) Q n(r fl ). Also recall that by Lemma 

C(r L ) U C{r R ) = [d] and C(r L ) n C{r R ) = 0. Hence, 

S \ (*(r L ) U *(r H )) C n(r L ) x n(r fl ) = U(r L ) * n(r fl ). 

This implies S 1 C *(r). Thus, from Vl/(r) we can compute 5 by keeping only vectors in Vl/(r) whose 
projection on any subset L € (i^) is contained in Sl- In particular, |5| ^ *(r) ^ (d — 1)P, proving ([24]). 



For the run time complexity of the above algorithm, we claim that for every node v, we need time 0(\ ^(v) | + 
|II(u)|). To see this note that for each node v, we need to do the following: 

(i) Make sure U(v L ) c{v) = IL(vr)c( v )> 
(ii) Compute G from II (vt), 
(iii) Compute W(v) = U(vr) n g U(v l ) U *(wl) U *(ur) and II(u) = i?. 

It can be verified that all of these steps can be computed in time near-linear in the size of the largest set 
involved (after sorting the sets all the required computation can be done with a linear scan of the input lists), 
which along with property 3 leads to a (loose) upper bound of 0(P + min^wu) kg) on the run time for 
node v. Summing the run time over all the nodes in the tree gives the claimed run time. 

To complete the proof we argue that properties 1-3 hold. For the base case, consider £ G £(T). Recall that 
in this case \P(£) = and H(£) = Si. It can be verified that for this case, properties 1-3 hold. 

Now assume that properties 1-3 hold for all children of an internal node v. We first verify properties 2-3 for 
v. From the definition of G, 



|II(vr) N G n(t; L )| < 



|n(va)| 



i • |n( VR )| < p. 



From the inductive upper bounds on ~^{vi) and vE'('ur), property 2 holds at v. By definition of G and an 
averaging argument, note that 

iri innkim m 1 ^ \U(v L )\ ■ \U(v K )\ 
\b\ = \u(v)\ < |n M • lP/mvR)l] < p • 

From the induction hypotheses on v L and v R , we have | n(t> L ) | ^ pfcm^-i and|II(t; R )| < p^ri^\=i ■ 

which implies that |n(«)| ^ p'lf-ffiff-i' ■ Further, it is easy to see that \U{v)\ < min(|n(u L )|, |II(vr)|), 
which by induction implies that |II(u)| ^ m.mi eC ^(v)) ke. Property 3 is thus verified. 

Finally, we verify property 1. By induction, we have (S \ ^(vjJWul) — ^(ul) and (S \ ^(vr))c( Vr ) f= 
II(vr). This along with the fact that C(v L ) n C(v R ) = C(v) implies that (S \ V{v L ) U ^(v R )) C ( v ) Q 
H-( v h)civ) n TI( v r)c(v) = B & G. Further, every vector in (S \ *(«l) U ^(^r)) whose projection onto 
C(v) is in G also belongs to II(ur) Mq n(vL). This implies that (S \ Sb(v))c(v) = B = IL(v), as desired. 
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E.3 Error- Tolerant Constructive Loomis-Whitney Inequality 

In this section, we prove the following version of the Loomis-Whitney inequality that can handle certain 
"errors," i.e. we are interested in the set S such that a few projections need not lie in the given input 
projection sets. In particular, we will prove the following: 

Theorem E.3. Let d ^ 2 and ^ e ^ d — 2 be integers. Let §i C E^xW be sets of vectors such that for 
every i£ [d], \Si\ = k{ for some positive integer k{. Let S C S rf be the largest set such that for every vector 
(v i, . . . , V(i) £ S, there are at least n — e values of i G [d] for which (yi,..., Uj_i, «i+i, • • • , Vd) £ Si. 
Then, 



i=0 Be (M) \je[d]\B 

Further S can be computed from the projections {<Si}«GWl m time 






6 E E t^-*- 1 ) n k A + 

\»=»Be(M) \je[d]\B J 

Proof. For each potential "error set" B C [d] with the number of errors \B\ ^ e, we apply the algorithm 




from Theorem E.l to join all Si, for £ S [d] \ B. The algorithm is identical to that of Theorem E.l except 
for two facts. First, the tree T now only has d — \B\ leaves, each identified by a member £ 6 [d] \ B. Each 
leaf £ has label C{£) = [d] \ {£} as before. Second, the product P is now defined to be 



p=i n h 

\j<m\B 

Note also that the label C(r) of the root r of T is no longer the emptyset; however, this fact does not change 
the analysis one bit. □ 

F Omitted Material from Section [3] 

Note that P = P 2 = P T P because any orthogonal projection matrix is symmetric and idempotent. Hence, 
for any two vectors x,y£ H N , 

(Px,y) = (x,Py) = (x,P r Py) = (Px,Py). 

In particular, (Pe^, ey) = ||Pej||| for any j G [N], The following was proved in [6]. We provide here a 
very short proof. 

Proposition F.l. Let <J> be an arbitrary real matrix of dimension m x N. Then, there exists j* £ [N] such 
that 

HPe,-*!! 2 , = (Pej.,ej*) > 1 - m/A^. 



Proof. Since (Pe^ej) is precisely the jth diagonal entry of the matrix P, we have trace(P) = X^'=i(-P e j> e i) 
But the trace of an orthoprojector is the dimension of the target space which is N — m in this case. Hence, 

N — m = X^7=i(P e j' e i)» which completes the proof. □ 
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El The forall case 

Corollary F.2 (Cohen-Dahmen-DeVore [6]). Let <£ be an m x N £2/^2 "forall" sparse recovery measure- 
ment matrix with k Js 1, i.e. there exists a decoding algorithm A and a constant C ^ 1 smc/z that for any 
input signal x £ M^, 

||x-A($x)|| 2 ^C-||x-x fc || 2 (27) 

?/ie« it must be the case that m ^ N/C 2 . 



F.l 

2 



Proof. Let y = A($0) = A(0 ). Then, By applying ((27]) with x = 0, it is easy to see that A(0) = A(&0) = 

Pe * 

there exists j* E [N] such that (Pe,-*,ej*) ^ 1 - m/7V. Let x = [Pe ] ,, 



0. Next, from Proposition 

then ||x||2 = 1 and (x, ej*)' z ^ ||F*ej* |||. Moreover, A($x) = -A(O) = because Pe^* is in the null space 



of <&. Consequently, from (27 1 we obtain 



1 = ||x||l < C 2 -||x-x fc ||l < C 2 - \J2 *}] = C 2 - (1 - <x,e,,) 2 ) = C 2 - (l - ||Pe^||l) < C 2 -m/N, 
which is the desired result. □ 

F.2 Proof of Theorem S3 

To complete our lowerbound proof, we need a "continuous" version of Yao's minimax principle. 

Definition F.3. A sparse recovery system S = (<J>, A) is defined to be a pair consisting of a measurement 
matrix $ £ R mx and a mapping A : M m — > R . We assume that the mapping algorithm A has a finite 
description length, is deterministic, and is the best mapping for the matrix $. 

We consider sparse recovery systems (<&, A) in a compact set Y (that is, our matrices $ are in a compact 
set in W nxN ). Let 1Z be a probability measure on sparse recovery systems in Y, by which we mean 1Z 
specifies a distribution on measurement matrices $ (and the mapping A is the best possible deterministic, 
finite description length mapping for that distribution on <J>). Let y be a compact set of convex combinations 
of probability measures TZ on sparse recovery systems in Y. 

Let us assume that our input vectors x G X and that X is a compact subset of 1^. Let V be a probability 
measure on input vectors in X and let X be a convex set of all such measures. 

Definition F.4. We say that a sparse recovery system S = ($, A) decodes x correctly if 

||x-A($x)|| 2 <C||x-x fc || 2 . 
We define the cost of a sparse recovery system S on input x as 

/ n\ 1 1 if A does not decode ~x. correctly 
cost(x, S) = < 

I otherwise. 

Thus, the failure probability of a randomized sparse recovery system S on input x is 

cost(x, TZ) = Es^ ( cost(x, S) ) . 
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Definition F.5. Let D G X and 1Z G y and define 

f(D, K) = cost(P, K) = E^ v ( E s ~k ( cost(x, S) ) j ■ = E s ~ n [ E x ^ ( cost(x, S) 



Observe that / is bi-linear and always finite (hence, it's a proper convex and concave function). Furthermore, 
we can change the order of expectation by Fubini's theorem. 

Lemma F.6 (Continuous Yao's Lemma). With the compact, convex sets X and y and the function f : 
X x y ^-M defined above, 

max min cost (D, S) = min max cost (x, 1Z). 
vex seY Key xex 

Proof. First, we observe that the hypotheses of the theorem match those of Sion's Minimax Theorem [26]; 
hence, we immediately have 

max min cost (D, TV) = min max cost (T>, TV), 
vaxn&y neyv&x 

(in particular, we have used the fact that X and y are compact to ensure that the suprema and infima are 
attained.) 

To finish the proof, we argue that for all distributions on recovery systems 1Z' G y, 

max cost (T>,1Z') = maxE x ^x>E5^7?/ (cost(x, S) ) = maxEs,^/ (cost(x, S) 

and that for all distributions on inputs V G X, 

min cost (£>' 1Z) = min Eq^-rM x ~t>' (cost (x, S) ) = minE x ^x" (costfx, S) 
n n \ J seY V 

from the convexity of X and y and the compactness of X and Y. □ 

So, if we find some distribution V on inputs for which the best sparse recovery system has failure probability 
at least p (i.e. high cost), then we have established a lower bound on the failure probability for randomized 
recovery systems: 

p ^ min cost (T> , 1Z) 
n&y 

^ max min cost CD, 1Z) 

vex Key 

= max min cost CD, S) 

vex seY 

= min max cost CD, 1Z) 
Key vex 

= min max costfx, TV). 
Key xex 

In Lemma [33| we exhibited a hard distribution on input vectors x for which the best sparse recovery system 
has failure probability at least y/l/'f ■ e~"2" ln ( 2 /T), given that 7 and 5 = m/N satisfy ([3]). It is not hard to 
see that 7 = S = 12 * 2 satisfy ([3]). Hence, any foreach sparse recovery system with failure probability at 

, = ln(6+8C 2 ) N N 

most p = v 12 + 16C 2 ■ e 2 '" must have at least m ^ 5N = 12+ i6c 2 measurements - m particular, 
we have shown that for failure probability 2~ e( * N \ the number of measurements is U(N). 
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We next give the simple reduction to handle the case of larger failure probability p > \/l2 + 16C 2 • 

ln(6+8C 2 ) AT , i ^ ln(6+8C 2 ) AT , 

e a N . Define iV' such thatp = >/l2 + 16C 2 • e a N , i.e. 

In this case for the hard distribution, we zero out the last N — N' entries in the input vectors and then apply 
the hard distribution on the first N' coordinates. This with the previous result implies for failure probability 



at most p, we need m = SN' = 0(log(l/p)) measurements, as desired. We just proved Theorem 3.4 



G Omitted Material from Section |4] 

G.l Known Results 

The following result from Cohen, Dahmen, DeVore [5] establishes a tight upper bound on the number of 
measurements with a polynomial time decoding algorithm for the foreach sparse recovery problem. The 
algorithm A is Orthogonal Matching Pursuit (OMP) which runs in time O(MNk). 

Theorem G.l. There is a distribution on M x N matrices $ such that for each x G R , the output of OMP 
after 2k iterations satisfies 

||x-A($x)|| 2 ^ C||x-x fe || 2 
with probability larger than 1 — p provided that M ^ C(k log(N/k) + log(l/p)). 

Cohen, et al. provide three examples of such distributions on matrices: (i) iid Gaussian random entries 
$ij with variance 1/M, (ii) iid Bernoulli random entries $y with values ±1/V¥, and (hi) columns of <J> 
drawn from a uniform distribution on S M ~^. 

G.2 Proof of Lemma I4l0l 

Proof We will use a simple repetition trick. M! is just s copies of M., where each copy gets fresh random 
bits. The decoding algorithm is as follows: given the outputs I\, . . . , I s from the s copies of M., output an 
i G U S j =1 Ij if it appears in > s/2 Ij's. Next we argue that the claimed bounds hold. 

First note that since | Uj =1 Ij\ ^ si and each i that is output appear in > s/2 intermediate outputs Ij, we 



can only output < 2£ such indices. Next note that by Lemma C.2 except with probability p n ( s \ at most 
s/4 outputs Ij miss more than Qk elements from i7fc(x). Call the remaining (at least 3s/4) intermediate 
outputs to be good. Note that we will not lose an i if it appears in > s/2 good Ifs. Then a simple counting 
argument implies that we can have at most 3(k that are missing form at least s/4 good ij's. □ 

H Proof of Theorem |477] 

Let u = (ui, . . . ,um) = .Mx. For any i G [N], define it's estimate Xi to the median of the values 
{Mi j ■ Uj}j<z-p(i)- We will show that 

Lemma H.l. The following holds with probability 1 — ( , ) . Except for jk positions i G [N], every other 
index will have a good estimate: i.e., 

\xi - Xi\ ^ \fnjk- ||z|| 2 . 
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In this section, we prove Theorem 4.7 using Lemma H. 1 The proof is almost identical to the similar one 



in E2fl (except x has a larger support size). 

Let x = (xi, . . . , x/v) and define x to be x with all but the top k' = k + kj ^fr] entries (by their absolute 
values) zeroed out. Further, let T be the s et of items in [N] that do not have a good estimate (i.e. \xi — X{\ > 



\Jr\jk- || z || 2). Note that by Lemma H. 1 \T\ ^ jk. 

To complete the proof of Theorem 4.7 we prove the following: 

Lemma H.2. There exists a vector y with |supp(y)| ^ 2^k such that for z = x — x — y, we have 

||z||! ^ (1 + 22^) • Hi- (28) 

Proof. We'll prove the above by case analysis and adding up the contribution of different indices to ||z||| 
(and also define the elements in y in the process): 

1. (i E supp(x) with a good estimate) In this case, each such item contributes ? • ||z||2 to 1 1 z; 1 1 ^ - Since 
|supp(x)| = k(l + 1/^/rj), items in this case contribute at most 2^/rj • 1 1 z: 1 1 § - (Set yi = 0.) 

2. (i E supp(x) with a bad estimate) These items do not contribute anything. (Set in = Xi — x;.) 

3. (i E supp(z) \ supp(x)) These contribute at most 1 1 z; 1 1 § - (Set yi = 0.) 

4. (i E Hk' (x) \ supp (x) with a good estimate is displaced by an item i' E supp (x) \ H ^ (x) with a good 
estimate) In this case set in = y; L t = 0. Thus, in this case, we have X{ = and z\ = Xi, Zy = xy — xy. 
Now note that by definitions of i and i', we have 

\xi'\ ^ \xi'\ - y/r]Jk\\z\\2 ^ \xi\ - y/r]/k\\z\\2 ^ \xi\ - 2y/rjjk\\z\\2. 

Next, note that any item i" such that |xj'/| > ^JrjJk\\z\\2 would satisfy i" E iffc'(x). This in turn im- 
plies that \xi'\ ^ -y/^/fcHzl^. This along with the above inequalities implies that \xi\ ^ Sy^/^INIb- 
This in turn implies that 

-2,-2^ 9? ? II 1,2 , V || ||2 10r ? || ||2 

f —I— v <C ■ 7 _1_ . rz — . U7 

' ' jfe " " 2 k ~ ~k~ 

Since there are at most k + k/y/fj such pairs (£,£'), the total contribution from this case is at most 

2cy^||z|||. 

5. (i E IZfc'(x) \ supp(x) with a bad estimate or is displaced by an item with a bad estimate) These do 
not contribute anything. (Set yi = X{.) 

It can be verified that the following three things holds: (i) x = x + y + z. (ii) Since every item with bad 
estimate can contribute at most two non-zero items to y (once in item 2. and once in item 5.), we have that 



supp(y)| ^ 27/c. (iii) Finally, by items 1, 3 and 4, (28 1 is satisfied. □ 



Some Remarks. We first note that it is possible to compute x efficiently in time near linear in N (as it 
involves computing N median values and then outputting the top k + k/^yrj values). 

The argument in item 4 in the proof above also implies the following: 

Corollary H.3. supp(x) contains all but 7 A: items i E [N] that satisfy \xi\ > 3y/rj/k ■ ||z||2. 

In particular, consider the following algorithm. 
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0. The input is the vector .Mx and a subset S C [JV]. 

1. For each j G S 1 , compute the estimate x j as the median of the values {|(-A4x)f,|} feer (j). 

2. Output in J the items j £ S with the top k + fc/?? estimates Xj. 



The above along with Corollary H.3 (where we substitute 7] by rf) implies Theorem 4.8 



Finally, the proof of of Theorem 4.7 also implies Theorem |4.9 



Proof Sketch ofTheorem \4.9\ The proof is pretty much mimics the proof of Theorem 4.7 except the follow- 



ing two small changes. First, we adjust the constants so that Lemma NTT implies that the median estimate 



is correct for all but (k/2 elements h[N], we have that x% is a good estimate of X{. Finally, in the argument 



of proof of Theorem 4.7 we also have to take into account the at most C^/2 elements of H k+k / v (x.) that 



are missing from S. (Note that the algorithm to compute x is the one we used to prove Theorem 4.8- just 



output the top k + k/r/ median estimates for the elements in S.) □ 

H.1 Proof of Lemma ELH 

We begin with some definitions and notation that will be useful in the proof. Let £ > be a real that we will 
fix later (to be 6(7)) and let e = £ 3 ??- 

We will call each element j G [M] a bucket. T(i) for some i G [N] will be the set of i's buckets. Finally, 
we will call a bucket j G T(i) bad for index % G [N] if at least one of the following conditions hold: 

• (Bad-1) j G T(h) for some heavy hitter h 7^ i. 

• (BAD-2) j G T(/t) for some heavy tail element h 7^ i. 

• (Bad-3) The t\ contribution of all light tail elements (other than i) to j is > V 2 • ||z||2- 

• (Bad-4) Define % = £&eA{<} and isr(&)0*&J • x b ). Then 

11- 11 ^ Fi 11 11 

l|x,|| 2 >^-||z|| 2 . 

If a bucket satisfies (Bad-6) for some b G [4], then we will also call it BAD-6-bucket for item i. If a bucket 
is not Bad-6 for any b G [4], then we will call it good for i. If we do not specify for which item a bucket 
is bad (or the more specific versions of bad as above) then, it'll be assumed to be bad for some i G [N]. 
For any % G [N] and b G [4], let B\ C [M] denote the set of BAD-6-buckets for item i and for notational 
convenience define B { = B] U B\ U B\ U Bf. 

Note that if for any i G [N], we have \T(i) n Bi\ < |, then \x% — xt\ ^ yv/k ■ ||z||2 (this is because then 
in the majority of the buckets, Xj is the only potential heavy element and the tail noise is low), as desiredp^ 
To prove that at most jk items have bad estimates we will prove the following: 



16 Actually we only need to consider buckets that are not Bad-1, not BAD-2 or not Bad-4 but not being Bad-3 makes the 
analysis somewhat modular. 
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Lemma H.4. For any subset S C [N] \ -Hfc(x) with \S\ = k, the following are true 

1 7 
I U iGjf / fc (x)u5 &i I < j^ki with probability 1. (29) 

I U ie H fc (x)u5 &i I < ^« with probability 1. (30) 

s 4 7 

I Ujg^^us jBj U £>j | ^ tk^I with probability at least 1 — exp(— fl^kl)). (31) 



Let B = l Ji£H k (x)uS^i- Then note that Lemma 



H.4 proves that except with probability exp(— Q(^k£)), 



B\ ^ jq ■ (4k)£. This implies that by Lemma B.2 (with a = 2), there are at most 4(7/16) (4k) = 7 A; 



elements i G H^ U S, such that they have at least £/2 bad buckets for item i. This shows that for every 5, 
the probability that there exists a subset T C Hfc(x) U S with |T| = 7/c such that every item in T has a bad 
estimate is upper bounded by exp(— Q(jk£)). Taking the union bound over all the ( A choices for T, the 
probability that there exists some set of yk items with a bad estimate is at most: 



jkj \fkj \\fkj J \jk 

where the inequality follows from the assumption that £ ^ c • log(N/k) for some large enough constant 
c ^ Q,(s log(l/7)). Thus, except with probability ( fc ) , other than at most jk items i € [N], every other 
item has a good estimate, as desired. 



In the rest of the section, we will prove Lemma H.4 
H.2 Proof of Lemma HL4l 



Proofof(29i. Fix any i 6 H^ (x) U S and consider a bucket j G T(i). If jisBAD-1 for i then it means 



that j is not a "unique neighbor" of -H&(x) U S. By Lemma B.3 we then get that 



I U iefffc(x)us B}\ ^ 2e(2k)£ = 4( 3 v k£ ^ ^k£ < ^k£, 
where the second inequality follows if we choose ( ^ 7/4 (as 7 3 ^ 7 and 77 ^ 1). 



Proof of ( 30 1. We will make separate arguments for heavy tail elements that are in S and those that are 
not. To be more precise let Hi n C S be the heavy tail elements in S and H out C [N] \ Hi n be the heavy tail 
items outside of S. We'll make slightly different arguments depending on whether \H ou t\ < 2k or not: 

• Case 1 (| .Hout | < 2fc) In this case we will prove the following stronger inequality: 

2 7 
I u ie// fe (x)usuH out &i I < ^kl. 

Fixani 6 Hk(x.)USUH ou t and consider a bucket j G T(i) that is Bad-2 fori. Then as in the previous 
case, we get a non-unique neighbor in r(Hfc(x) U S U H out ). Note that |-Hjt(x) U 5 U H OU f| < 4/c and 



thus by Lemma B.3 we have 



Ui 6ffA( x)uSuiW #?l < 2e(4/c)£ = 8C 3 vk£ < ^ < ^H, 



where the second inequality follows if we choose £ ^ 7/6. 
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Case 2 (\H out \ ^ 2k) In this case we will handle collisions with heavy tail elements from H out 
separately. Fix an element i G -fffc(x) U S and consider a bucket j G T(z). If j is also in T(/i) for 
some heavy tail element /i G Hi n , then call such a bucket light-BAD-2bucket for i. Otherwise if 
i collides with some heavy tail element h G H ou t in bucket j, then we call j to be a heavy-BAD- 



2bucket for i. By the same argument as in the proof of (29 1, we can upper bound the number of 
light-BAD-2buckets for any i by 

2e(2k)£ = 4( 3 rjk£ < — k£ < —kl, (32) 

125 24 

where the first inequality follows if we choose ( ^ 7/5. 

To bound the number of heavy-B AD-2buckets we note that this would be upper bounded by \T(Hk (x)U 



S) n T{H out ) |. By Lemma B.l this is upper bounded by 



4e\H out \£^ 4e • -^ • £ = 4(k£ ^ ^-k£, (33) 

Q z r] 24 

where the first inequality follows by noting that the total number of heavy tail items (which upper 
bounds \H out \) is at most k/(C 2 r]) and the last inequality follows if C ^ 7/96. 



Upper bounds in ( 32 1 and ([33]) proves ( 30 1. 



Proof of pi) . We will prove ( [31) by proving the following two inequalities. We will show that the fol- 
lowing always holds: 

\U l&Hk (x)usBf\^^k£, (34) 

and the following does not hold with probability at most exp(— Q,{^k£))\ 

\U ieHk (x)usBl\B!\^^kl (35) 



Proof of (34 1. We will prove (34i using an argument very similar to that in |[221 . (The only difference is 
that the argument for light tail with small support is less involved because of the use of expanders.) We will 
first compute the sum of the i\ contribution of the light tail elements to T(Hk('x.) U S). The main idea is 
to decompose this sum into (sub)-convex combination of flat tail contributions and contribution from a tail 
with small support. 

First, we consider the contribution of the light tail elements from S itself that contribute to the potential 
Bad-3 buckets. We take care of this contribution with an argument similar to the one we used to prove ([29]). 



In particular, if bucket j is BAD-3bucket for i G -fffc(x U S) due to a light tail element from S then bucket 



j is not a unique neighbor. Thus, the number of such B AD-3buckets is upper bounded (due to Lemma B.3 1 
by at most 4C 3 rjk£ ^ ^k£ < ^gk£ if we choose C ^ 7/6. 

Next, let L C C \ S be the set of light tail elements with non-zero value outside of S. We will first bound 
the sum of the £\ contribution of light tail elements from L to r(fffc(x) U S)- we denote the sum by S^. 
Define w = (xf)i^L- (Note that ||w||i ^ ||z|||.) 

We first assume that \L\ ^ 2k. Let m be the smallest Wi value among all i G L and consider the vector 
w m where supp(w m ) = L and each non-zero value is m. (Note that w m is a flat tail scaled by m|L|). 
By Lemma B.l[ the contribution of w m to S^ is upper bounded by 4e£m\L\ = 4e£||w m ||i. Update w 



w — w m and L <— L \ {i G L|w = m}. If \L\ ^ 2k we repeat the process above. We note that the total 
contribution while \L\ ^ 2k in the process above is at most 4e^||w||i ^ 4e^||z||2. Finally, when we are left 
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with \L\ ^ 2k, since each element in the residual w can contribute at most C, 2 rj/k ■ \\z\^ (repeated at most 
times) to S^,, we conclude that the final contribution to Y,l is at most 2£ 2 7/£||z||2. This implies that 



S L ^(4e + 2C 2 r ? K||z||^6C 2 



Hi 



In the worst-case the sum E^ contributes to new BAD-3buckets. In particular, by the Markov argument 
(and the bound on £/, above along with the fact that there are at most 2k£ buckets), the number of new 
BAD-3buckets is upper bounded by 6(k£ ^ ^gk£, if we pick 7 ^ £/288. 



Adding the contributions of light tail elements from S and outside of S proves ( 34 1 



Proof of < |35p . Consider any bucket j that is not a BAD-3-bucket Th is implies that the £2 contribution from 
the light tail elements is at most \j~Cj)]k ■ ||z||2. Then by Lemma C.l we have that with probability at most 
(, we have 



1 2 (where Xj is as defined earlier). This implies that the expected number of 
BAD-4buckets is at most (k£. Further, since every non-zero M-ij is an independent ±1 value, we can apply 
Chernoff bound (Lemma C.2) to bound the probability of more than 2(k£ BAD-4buckets by exp(— Q((k£)). 
Thus, except with probability exp(— Q,((k£)) the number of BAD-4buckets is upper bounded by 



2(k£^ 



1_ 
24' 



if we pick ( ^ 7/48. 



Wrapping up. Looking at all the conditions on (, we note that the choice ( 



quired conditions. Note that this also implies that the probability of not satisfying ( 35 I is upper bounded by 
exp(— 0(7/2^)), as desired. 



7 

288 



satisfies all the re- 



H.3 Some observation on the use of randomness 



We start off by observing that the only place that uses randomness in the proof of Theorem 4.7 is in 



Lemma H.4 Further, the only place in the proof of Lemma H.4 that uses randomness is in the proof of 



(35 1. 



First note that by the dependence of the probability bound in Lemma C.2| we can argue the same probability 



dependence as in (35 1 even if the random ±1 values were 0(fef)-wise independent. In other words, 



Remark H.5. Theorems 4. 7. 4.8 and 4.9 hold even if the random ±1 entries are 0(k£)-wise independent. 



Next we note that if we were proving the corresponding £±/£i result to Theorem 4.7 then we would not 
need any randomness. In particular, for the £i/£i case, if we define Bad-3 event to be such that the total 
£\ mass of all light tail elements other than i is > ^7y/A;||z||i, then if a bucket is not Bad-3 then we do not 
need to consider the Bad-4 event. In other words, 



Remark H.6. The versions of Theorems 4.7 4.8 and 4.9 for £\j£\ sparse recovery holds even without 



multiplying the matrix M.q with random ±1 values. In other words, the results hold deterministically. 



I Proof of Lemma 4.13 



The algorithm is the same as in the proof of Theorem 4.8 
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The main idea in proving the correctness of the algorithm is to argue that under the map /, the heavy hitters 
do not suffer much collisions or "acquire" heavy £ 2 noise. Then we apply Corollary H.3 (or more precisely 



the proof of Lemma H.l I on the vector obtained by "applying" / to x. In what follows, we will be using 
notation from the previous section. 

We first consider the effect of / on x. Call a heavy hitter h € Hk(x) being corrupted if either f(h) = f(h') 
for some heavy hitter/heavy tail element h! or the £?, sum contribution of light tail elements to f(h) is 
> ^-A ■ ||z||2. We next argue that 

Lemma 1.1. Except with probability (4M , at most 5(k heavy hitters are corrupted. 



Proof. The argument is very similar to that in 112211 . 

To being with we upper bound the number of corruptions due to a collision. Just for this proof, we will refer 

l |9 C 5 ^ 2 ||z|| 2 

to an element i G [N] as a heavy tail if | x j | > -£ 2 . 

Note that even conditioned on the the maps of the heavy tail items (of which there are at most (~ 5 rj~ 2 k) 
and the other heavy hitters (of which there are at most k — 1), a given heavy hitter suffer a collision with 
probability at most 0(A;(£~ 5 7/~ 2 + 1)/M l ~ a ). Then using a similar argument used in (22b . we can applying 



Lemma C.2 to bound the probability of having more than C,k corruptions due to collisions by 



ek 2 (i + rv 2 ) a nm < (m\ Q( ' a(k) 



M 1 ~ a Ck J V k J 

where the inequality follows from the fact that AL & ° 2 ^ (M/k) a (which in turn follows from the lower 
bound M > ft(C~V 2 ■ k^ ■ (log(N/k)) 2 )). 

Next we upper bound the number of corruptions due to large £ 2 noise from light tail items. Define w G 1^ 
be to zero on the locations of heavy hitters and heavy tail items. For a light tail element i, define Wi be x 2 
rounded up to the next highest number in {||z||2/2*}j^o- Note that all i such that \w{\ ^ C, 5 rj 2 \\'L\\ 2 ,/N can 
contribute at most £ 5 ?7 2 ||z||2. 

Thus, ignoring such elements, since the largest value of a light tail element is C, 5 rj 2 \'L\\ 2 ,/k, we are left with 
n' = log [C r] 2 N/k) = 0(log(N/k)) distinct values in w. In particular, we can decompose w into n' 
(scaled) flat tails: call these tails wo, ... , w n /_i. (W.l.o.g. assume that Hwj+iHoo ^ || w« ||oo/2.) Call a tail 
Wj small if |supp(w.j)| ^ k/(( 2 rj). Otherwise, call a tail Wj large. Note that since ||wo||oo ^ C 5? ? 2 || z ll2/^' 
the total contribution of all small tails is at most 

n'-i , , 

V -2- • Hwolloo • 2 _i < 2—- • Hwolloo < 2C 3 r?||z||2. 

Now consider a large tail Wj. Each item in supp(wj) collides with a heavy hitter (independently) with prob- 
ability at most 0{k/M l ~ a ) (even conditioned on the map of a heavy hitter under /). Thus, by Lemma C.2 
except with probability 

efc|supp(w t )| \ "(^IsuMwOI) < ( ek \ nm 
C 3 r/M- lQ |supp(wi)|y ^ \( 3 r]M 1 - a ) 

Wj contributes at most ( 3 r/\\ wjI^ I 2 , noise to the heavy hitters under /. Thus the total noise contribution of 

all the large tails is at most £ 3 ??||z||2 except with probability n! ■ ( S-L j ^ (M/k)~ a ( at > k \ where 

the last inequality follows from the lower bound on M. 
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Thus, the total £| contribution of all the light tail elements to the heavy hitters under the map / is at most 
4C 3? 7 1 1 z 1 1 2- Thus, by the Markov inequality at most A(k heavy hitters get corrupted by a light tail element. 



fi(-aCfc) 



Thus, except with probability (¥-) , at most 5(k heavy hitter get corrupted, as desired. 



□ 



The rest of the proof is very similar to that of Theorem 4.7 so we will just sketch the differences below. 
The main reason we can do this is because other than the analysis of the Bad-4 buckets, we essentially are 
working with the £?, tail. (Also the analysis of the Bad-4 buckets on depends on whether the i\ noise in a 
bucket is large or not.) 

To be more precise, define a vector related to x where all but the heavy hitters are squared: i.e. y = 
(yi, . . . , yjv), where yi = X{ if i € -fffc(x) otherwise y, = x 2 . Now define /(y) vector to be the natural 
"partial sum" vector of y under /: i.e, w = (w\, . . . , wm) = /(y) such that Wj = J2i-f(i)=j Hi- F° r tne 
proof, call an item j G [M] a heavy hitter if it contains an uncorrupted heavy hitter from x mapped under 
/ to it. Note that by Lemma 4.13 we can assume that there are at least k(l — 4£) heavy hitters in w. As 
before, define an item j G [M] to be a heavy tail item if \wj\ > ^rjWzW^/k. Note that we can have at most 
£-2^-1^, _|_ 4^ ( w here the last term can arise from the corrupted heavy hitters from x). Now the rest of 
the argument remains unchanged (by adjusting the constants), except for the following simple change- in 
a bucket that is not Bad-1 Bad-2 or Bad-3, we still have to account for the £ 2 noise that an uncorrupted 
heavy hitter obtains due to the mapping /. However, this only increases the 



2 noise by a constant factor 



than what was handled earlier. Again by adjusting constants one can arrange for the claim in Lemma 4.13 



(One also has to add up the failure probabilities from Lemma 1. 1 and the one gets from the expander part of 
M. but both are of the same order.) 



1.1 Some observations on the use of randomness 



In this section, we make some observation on the use of randomness in the proof of Lemma 4.13 which were 



not already covered in Section H.3 



The new place where one uses randomness in the proof of Lemma 4.13 is in Lemma |LTj In particular, we 
used the randomness of the map / in proving bounds d36b and ( 37 1. We note that both of these bounds would 



hold even if the map / were 0(/c)-wise independent. Along with Remark|H.5 this implies that 



Remark 1.2. Lemma 4.13 hold even if the random ±1 entries are 0(k£)-wise independent and the map f 



is only 0(k)-wise independent. 



Next we note that if we were proving the corresponding ^i/£i result to Lemma 4.13 then by Remark H.6 
we do not need the random ±1 entries. There is another source of randomness- the map /. We will not 
get rid of the randomness used in the map / (at least not quite yet). However, for future use it would be 
beneficial to note if one were to try and get rid of the randomness in proof of Lemma [TT] how many events 
would one have to take a union bound against. 



Let us first consider ( 36 1. In this case we assume that the locations of the at most k' = ( 5 r] 2 k + k "heavy 



hitters" are fixed. Thus if we can do union bound over all ( fc/ ) choices of these items, then we would be able 



to prove that there are no more than Qk collisions probabilistically. Now let us consider (37 1. We first note 
that we only used randomness in trying to bound the corruption due to noise from light tail elements that 
come from 0(\og{N/k)) large tails. In particular, we note that the argument assumes that the following are 
fixed: (i) The locations of the at most k heavy hitters and (ii) assignment of the light tail elements to one of 
the 0(log(A?y£;)) large tails (or if it is not part of any large tail). The key observation is that we only care 
about the locations of elements in (i) and (ii) but not the actual value at those locations. Note that then the 
number of choices are (J) + jV°( lo sW fc )). 
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Remark 1.3. The version of Lemma \4.13 for l\ jl\ sparse recovery holds even without multiplying the 



matrix A4q with random ±1 values. Further, one can also get rid of the randomness in the map f if we are 

Lit' 



willing to take union bound against (,,) + N x events, where k' = 0{C, 5 r] 2 k) and x = 0(log(N/k)). 



J Inverting a Random function 

We pick / : [N] —> [N 2 ] randomly by picking a random degree N — 1 polynomial over ¥ N 2 and define f(i) 
to be the evaluation of this polynomial at i. It is well-known that such a random hash function is iV-wise 
independent (see e.g. ifTTTl ) or fully independent. 

We have three issues to tackle. 

First, we now have to first compute the matrix M! with iV 2 columns (then we pick the columns of M! 
indexed by f(i) G [N 2 ] for i G [N]) instead of N columns as we have used so far. This however, is not an 
issue since we have a family of matrices M. n . This change results in some constants in the parameters of 
M.* changing but it does not affect the asymptotics. 



Second, we have to show that this new / is fine with respect to applying Lemma 4.13 In particular, we 
have to show the following: consider any node v in the recursion tree. Then we will show that for the (new) 
4> v : [N] — > [M] is (d, a)-random (for some small a). We will in fact show that these functions are (d, 0)- 
random. Indeed, since / is N > (d + l)-wise independent and the codes {C n } are uniform, the new map 
4> v is also (d + l)-wise independent. (This can be proven e.g. by induction on the level of v.) This in turn 



implies that 4> v is (d, 0)-random, which is sufficient for Lemma 4. 13 

Finally, we come to the inversion. Given j G [N], we would like to compute "/ (i)." (Recall we do this 
step only once at the root of the recursion tree after the identification algorithm above has terminated with 
the output I w C [iV 2 ]. Recall that \I W \ ^ 0(k/rj).) Unfortunately, we cannot still guarantee that the inverse 
is unique. So we do the following: If we have j G I w such that |/ _1 (j)| > 1, we just drop this index; 
otherwise we output f~ l {j). We first argue that this step does not drop too many indices and then consider 
how quickly we can solve this step. 

Recall that we only care about whether we identify elements of iJj./(x), where k' = 0(£ _2 ?7 -1 A:). Note that 
since / is completely independent, for every i G H^i^x) even conditioned on the value of f(j) for every 
other j G [N] \ {i}, f(i) is completely independent. Thus, the probability that f(i) = f(j) for one of these 
j is at most N/N 2 = 1/N. Thus, by the argument similar to the one in E2l (to care of the dependencies), 
we can use Lemma [C2| to argue that the probability of more than Qk i's colliding with another j is at 
most (N/k) ^ k \ Thus, the algorithm above loses an extra (k indices while maintaining the same failure 
probability. This extra C,k additive factor only changes the constants and thus, can be absorbed into the 
analysis without much trouble. 

We now come to the part about computing / _1 (j). To do this step, we just store the pairs (/(«), i)ie[N] i n an 
array (sorted by the first entry) of size 0(N log N) bits. Note that given this array (by binary search), we can 
in time 0(log N), given a j G [-/V 2 ], figure out / _1 (i) (if it is unique). Since we have to do this inversion 
0(k/rj) times, we add an additional factor of 0{kjr\ log N) to the identification time. (This additive factor 
will never be asymptotically significant in our final results.) 

We will call this scheme above SCHEME-1. Note that we have show that SCHEME-1 works and 



Lemma J.l. SCHEME- 1 adds 0(k/rjlog(N/k)) to the decoding time in Theorem 4.1 5 and needs 0(N log N) 
bits of space. 
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We also need 0(£ 5 r/ 2 • k ■ log 2 N) bits of space to store the randomness needed to define g (which we 
need to store). However, this is subsumed by the 0(N log N) space to store the array above. 
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Figure 1 : Overview of the proof of Lemma 



4.3 



for the specific example of N — 8 and one level of recursion. The encoding 
Xr) is moved around by a random map / (in this case / is a permutation). 



can be thought of as follows. The input vector (xo, 
Then the shuffled vector is acted upon by one expander on level and three expanders on level 1. (For clarity the shuffled vector 
is copied next to the four expanders.) The green and red labels on the expander edges denote +1 and —1 weights respectively. At 
the level expander, the eight values are routed on the edges and are collected in the output (gray) nodes by adding the incoming 
values (weighted by the corresponding ±1 label). At the level 1 expander a similar process happens except the shuffled vector is 
combined using the following code on the indices (which are represented in the vector next to the level expander for convenience) 
~~ C(b ,bi,b 2 ) = ((6 ,6i),(6i,6 2 ),(6o, 



from Lemma 



2.5 



, &2 ) ) . The left most level 1 expander corresponds to the first codeword 
position, the middle one to the second position and the right one to the third position. Also the values corresponding to the indices 
that map to the same symbol in {0, l} 2 are added up together. For convenience, we have used the same colored arrows for the same 
symbol in {0, l} 2 for every level 1 expander. E.g., in the middle expander, the top left vertex corresponds to (pi, 62) = (0, 0), 
which means index 000 (which has value Xx) and index 100 (which has value xs) get added up. The decoding process reverses the 
encoding logic. We first identify the heavy hitters for level 1 expanders using the identification alg orith m from Lemma [4.13| These 
lists are then combined for the {0, l} 3 domain by the list recovery algorithm for C from Lemma 2.5 The output is then used as 
the set S in the identification algorithm from Theorem |4.8| to get the locations of the heavy items in the shuffled vector. Finally, we 
used the inversion procedure as outlined in Section |4.5.1| to obtaf6 our fi na l indices, which are the 2nd and the 6th positions. (For 
clarity, the set of identified indices at every step are surrounded by a light blue curve.) 



